Introduction 介绍

Temporomandibular disorders (TMDs) are defined as a set of musculoskeletal and neuromuscular conditions involving the masticatory musculature, the temporomandibular joint (TMJ), and/or surrounding tissues [1] and are the second most common chronic musculoskeletal disorder, affecting approximately 31% of adults and 11% of children and adolescents [2]. According to the lasted diagnostic criteria, TMDs include pain-related TMDs and intra-articular TMDs. Arthralgia, myalgia, local myalgia, myofascial pain with referral, and headache attributed to TMDs are common subgroups of pain-related TMDs while disc replacement, degenerative joint disease, subluxation are of intra-articular TMDs [3]. TMDs encompass symptoms such as joint pain, limitation or deviation of jaw movement, and joint sounds which usually impact the physical and psychological well-being of patients. Due to the rapid development of society, more and more individuals are in the condition of stress and anxiety which would contribute to the progression of TMDs [4].
颞下颌关节紊乱病(TMD)定义为一组涉及咀嚼肌、颞下颌关节(TMJ)和/或周围组织的肌肉骨骼和神经肌肉疾病[ 1],是第二常见的慢性肌肉骨骼疾病,影响约31%的成人和11%的儿童和青少年[ 2]。根据最新的诊断标准,TMDs包括疼痛相关性TMDs和关节内TMDs。关节痛、肌痛、局部肌痛、转诊肌筋膜痛和归因于TMD的头痛是疼痛相关TMD的常见亚组,而椎间盘置换、退行性关节病、半脱位属于关节内TMD [ 3]。TMD包括关节疼痛、下颌运动受限或偏斜以及关节音等症状,这些症状通常会影响患者的身心健康。 由于社会的快速发展,越来越多的人处于压力和焦虑状态,这将有助于TMD的进展[ 4]。

The diagnosis of conditions relating to TMDs still pose a significant challenge. Medical imaging contains abundant information about diagnosis and treatment, and it is crucial to assist dentists in diagnosing TMDs. However, due to the difficulty in explaining medical images or inexperience, senior dentists cannot make an immediate or correct diagnosis usually. For example, magnetic resonance imaging (MRI) shows joint discs clearly but is not a routine image in oral examination, which means some dentists may not be trained to interpret temporomandibular joint MRI. Panoramic radiography is a common type of medical images in clinics because of the small amount of radiation and low cost. However, condyles are easy to neglect due to overlapping structures, or dentists may ignore the area of TMJ because early condyle bone change is difficult to observe.
与TMD相关的病症的诊断仍然构成重大挑战。医学影像学包含了丰富的诊断和治疗信息,是辅助牙科医生诊断TMDs的关键。然而,由于解释医学图像的困难或经验不足,高级牙医通常不能做出立即或正确的诊断。例如,磁共振成像(MRI)清楚地显示关节盘,但不是口腔检查中的常规图像,这意味着一些牙医可能没有接受过解释颞下颌关节MRI的培训。全景X线摄影由于其辐射量小、成本低,是临床常用的一种医学影像。然而,髁状突由于结构重叠而容易被忽略,或者由于早期髁状突骨变化难以观察,牙科医生可能会忽略TMJ区域。

Recently, the application of artificial intelligence (AI) to visual tasks, known as computer vision, has generated significant interest within the medical community [5]. Machine learning (ML), an important subset of AI, is widely used in medicine and can use a large amount of data to learn the mapping between features and outputs to solve complex problems [6]. ML includes supervised and unsupervised machine learning, while the former uses a labeled dataset to train models, and the latter directly extracts features from unlabeled data. Traditional supervised ML algorithms include random forests, logistic regression, support vector machines (SVM), and more. Deep learning (DL) is a special type of ML constructed by simulating the connection of human brain neurons, which has advantages in processing complex medical images. Unsupervised ML includes principal component analysis and cluster analysis. Therefore, ML is an appropriate way to improve the diagnostic accuracy of TMDs. Currently, ML algorithms are widely used in dental radiology, and studies on the diagnosis of TMJ using ML have gradually increased. Many studies have discussed the accuracy of ML algorithms in TMDs diagnosis using different medical images. However, the misdiagnosis of diseases will bring serious losses to patients, and the application of machine learning in clinical environments may cause ethical issues [7]. Therefore, it is very important to evaluate the accuracy of models in the diagnosis of TMDs. This paper reviewed research on the intelligent diagnosis of TMDs in several types of medical images using ML models and we hope it will be beneficial to future research.
最近,人工智能(AI)在视觉任务中的应用(称为计算机视觉)在医学界引起了极大的兴趣[ 5]。机器学习(ML)是AI的重要子集,广泛应用于医学领域,可以使用大量数据来学习特征和输出之间的映射,以解决复杂问题[ 6]。ML包括监督和无监督机器学习,前者使用标记的数据集来训练模型,后者直接从未标记的数据中提取特征。传统的监督ML算法包括随机森林,逻辑回归,支持向量机(SVM)等。深度学习(DL)是通过模拟人脑神经元的连接而构建的一种特殊类型的ML,在处理复杂的医学图像方面具有优势。无监督ML包括主成分分析和聚类分析。因此,ML是提高TMD诊断准确性的合适方法。 目前,最大似然算法在牙科放射学中得到广泛应用,利用最大似然算法诊断颞下颌关节疾病的研究也逐渐增多。许多研究已经讨论了ML算法在不同医学图像的TMD诊断中的准确性。然而,疾病的误诊会给患者带来严重的损失,机器学习在临床环境中的应用可能会引起伦理问题[7]。因此,评价模型的准确性在TMDs诊断中具有重要意义。本文综述了利用ML模型对几种类型的医学图像中的TMD进行智能诊断的研究,希望对未来的研究有所帮助。

Methods 方法

A protocol was registered online in the PROSPERO (ID: CRD42023395128), and the protocol and this review followed the PRISMA-DTA statement [8] and Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy.
在PROSPERO(ID:CRD 42023395128)中在线注册了一项方案,该方案和本综述遵循PRISMA—DTA声明[8]和诊断检测准确性系统综述科克伦手册。

Eligibility criteria 资格标准

Inclusion criteria were as follows:
入选标准如下:

  1. (1)

    Population: Temporomandibular disorder patients and/or healthy participants.
    人群:颞下颌关节紊乱病患者和/或健康受试者。

  2. (2)

    Index tests: Any diagnostic test based on machine learning (including deep learning).
    指数测试:任何基于机器学习(包括深度学习)的诊断测试。

  3. (3)

    Reference standard: Clinical diagnosis by physicians.
    参考标准:医生的临床诊断。

  4. (4)

    Target condition: Pain-related TMDs and intra-articular TMDs including arthralgia, myalgia, local myalgia, myofascial pain with referral, disc replacement, degenerative joint disease of temporomandibular joints.
    目标条件:疼痛相关TMD和关节内TMD,包括关节痛、肌痛、局部肌痛、转诊肌筋膜痛、椎间盘置换术、颞下颌关节退行性关节病。

  5. (5)

    Study design: Single diagnostic test accuracy studies (SDTA) and comparative diagnostic test accuracy studies (CDTA).
    研究设计:单一诊断试验准确性研究(SDTA)和比较诊断试验准确性研究(CDTA)。

Exclusion criteria were as follows:
排除标准如下:

  1. (1)

    Intervention studies, etiological studies, or prognostic studies;
    干预研究、病因学研究或预后研究;

  2. (2)

    Reports focused on other clinical questions;
    关注其他临床问题的报告;

  3. (3)

    Ongoing studies; and 正在进行的研究;以及

  4. (4)

    Reports without eligible outcomes (i.e., abstracts or protocols published only).
    没有合格结局的报告(即,仅发表的摘要或方案)。

Search methods  搜索方法

Twelve databases were searched up to 19 July 2023 for published and unpublished reports: (1) Europe PubMed Central (Europe PMC), (2) Embase via Ovid, (3) Evidence-Based Medicine Reviews (EBM Reviews) via Ovid, (4) Scopus, (5) Web of Science Core Collection (WOSCC), (6) Information Service in Physics, Electro-Technology and Computer and Control (Inspec), (7) Korea Citation Index (KCI), (8) Scientific Electronic Library Online (SciELO), (9) WHO Global Index Medicus (GIM), (10) arXiv.org, (11) Open Science Framework Preprints (OSF Preprints), and (12) IEEE Xplore. Moreover, two register platforms were searched up to 19 July 2023 for registered clinical trials as well: (1) WHO International Clinical Trials Registry Platform (ICTRP), and (2) ClinicalTrials.org. There were no restrictions on language or publication date. Furthermore, a cited reference search was conducted based on the included studies. More details about search strategies are available on the page 5 to 12 of the Online Resource.
截至2023年7月19日,检索了12个数据库中的已发表和未发表报告:(1)欧洲PubMed Central(欧洲PMC),(2)Embase通过奥维德,(3)循证医学综述(循证医学评论)通过奥维德,(4)Scopus,(5)科学网核心收藏(WOSCC),(6)物理,电子技术和计算机与控制信息服务(Inspec),(7)韩国引文索引(KCI),(8)科学电子图书馆在线(SciELO),(9)WHO全球医学索引(GIM),(10)www.example.com,(11)开放科学框架预印本(OSF预印本),和(12)IEEE Xplore。此外,截至2023年7月19日,还检索了两个注册平台,以查找已注册的临床试验:(1)WHO国际临床试验注册平台(ICTRP)和(2)ClinicalTrials.org。此外,基于纳入的研究进行了引用参考文献检索。 有关搜索策略的更多详细信息,请参阅在线资源的第5至12页。

Selection and data collection of studies
研究的选择和数据收集

Two review authors screened the title and abstracts of each record retrieved independently and then obtained and assessed the full reports for all studies that appeared to meet the inclusion criteria, and any disagreements were resolved by discussion or the involvement of another review author as an arbiter.
两名综述作者独立筛选检索到的每条记录的标题和摘要,然后获得并评估所有似乎符合纳入标准的研究的完整报告,任何分歧都通过讨论或另一位综述作者作为仲裁者的参与来解决。

Two review authors extracted data from included studies independently and resolved their disagreements by discussion or the involvement of another review author as an arbiter. The following characteristics were extracted from included studies: methods (study designs, periods, locations, setting, and funding sources), participants (inclusion and exclusion criteria, number of patients, gender, and age), index tests (machine learning models), reference standards, outcomes, and other information.
两位综述作者独立地从纳入的研究中提取数据,并通过讨论或另一位综述作者作为仲裁者的参与来解决他们的分歧。从纳入的研究中提取了以下特征:方法(研究设计、周期、地点、环境和资金来源)、参与者(入选和排除标准、患者数量、性别和年龄)、指数检验(机器学习模型)、参考标准、结局和其他信息。

Assessment of methodological quality in included studies
纳入研究的方法学质量评估

Two review authors assessed risk of bias (internal validity) and applicability (external validity) in included studies independently and resolved their disagreements by discussion or the involvement of another review author as an arbiter. The methodological quality in SDTA studies was assessed with the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) [9]. The signal questions used in the present study were reported on Page 63 of Online Resource.
两位综述作者独立评估了纳入研究的偏倚风险(内部效度)和适用性(外部效度),并通过讨论或另一位综述作者作为仲裁者的参与解决了他们的分歧。采用QUADAS-2(诊断准确性研究质量评估)评估SDTA研究的方法学质量[ 9]。本研究中使用的信号问题报告在在线资源的第63页。

Effect measures 有效措施

For each included study, it was calculated sensitivity and specificity with 95% confidence intervals (CI), at individual test level for thresholds of interest. Primary thresholds of interest were based on the thresholds reported in the original included studies.
对于每项纳入的研究,计算敏感性和特异性,95%置信区间(CI),在单个检测水平下的目标阈值。主要关注阈值基于原始纳入研究中报告的阈值。

Synthesis methods 合成方法

If there were not more than four studies included in the individual index test level, it would undertake meta-analyses to pool the sensitivity and specificity using the random-effects model. These data were visualized using forest plots of the sensitivity and specificity.
如果单个指标检验水平中包含的研究不超过4项,则将进行荟萃分析,使用随机效应模型汇总敏感性和特异性。使用灵敏度和特异性的森林图可视化这些数据。

Assessment of reporting bias
报告偏倚评估

If there had been more than ten studies in the same synthesis, the publication bias would be assessed by visually inspecting a funnel plot for asymmetry.
如果在同一合成中有超过10项研究,则将通过目视检查漏斗图的不对称性来评估发表偏倚。

Results 结果

Selection of studies 研究的选择

As shown in Fig. 1, a total of 1660 records were retrieved from twelve databases and two registry platforms. It was screened titles and abstracts of all the records and removed 386 duplicates. 1274 records were screened on EndNote Desktop and online. Then it was identified 29 reports after excluding 23 reports following the full-text review. In total, 28 studies (29 reports) were included in this systematic review. Details about search strategies and the selection process of studies are available on Page 3–12 of Online Resource.
如图1所示,从12个数据库和2个登记平台中共检索到1660条记录。筛选了所有记录的标题和摘要,删除了386份重复记录。在EndNote Desktop和在线上筛选了1274条记录。然后在全文审查后排除23份报告后,确定了29份报告。本次系统性综述共纳入28项研究(29份报告)。有关搜索策略和研究选择过程的详细信息,请参阅在线资源的第3—12页。

Fig. 1 图1
figure 1

PRISMA 2020 flow diagram for this review
PRISMA 2020审查流程图

Characteristics and risk-of-bias of included studies
纳入研究的特征和偏倚风险

Study characteristics 研究特征

Characteristics of individual studies were summarized in Table 1. These studies used different types of the medical image including magnetic resonance imaging (MRI, n = 8), panoramic radiographs (n = 4), cone-beam computed tomography (CBCT, n = 11) and other image modalities (n = 5). Included studies focused on various diseases: temporomandibular joint DJD (n = 16), DD (n = 6), joint perforation (n = 1), joint osteoporosis (n = 1), arthropathy and myopathy (n = 1) and TMDs (n = 4). Many studies used more than one algorithms; deep learning (n = 15), k-nearest neighbor (KNN, n = 4), support vector machine (SVM, n = 7), random forest(n = 7), logistic regression (n = 3), naïve Bayesian (n = 2), extreme gradient boosting (n = 3), light gradient boosting machine (LightGBM, n = 2), and multiple regression analysis (MLP, n = 3). Deep learning models included ResNet-152, Yolo v5, EfficientNet-B7, Inception ResNetV2, Inception V3, VGG-16, DenseNet-169, single-short detector (SSD), Xception, ResNet-101, MobileNetV2, DenseNet-121, ConvNeXt, learning using privileged information (LUPI), TensorFlow and so on.
各项研究的特征总结见表1。这些研究使用了不同类型的医学图像,包括磁共振成像(MRI,n = 8)、全景X线片(n = 4)、锥形束计算机断层扫描(CBCT,n = 11)和其他图像模态(n = 5)。纳入的研究集中于各种疾病:颞下颌关节DJD(n = 16)、DD(n = 6)、关节穿孔(n = 1)、关节骨质疏松症(n = 1)、关节病和肌病(n = 1)和TMD(n = 4)。许多研究使用了不止一种算法;深度学习(n = 15),k-最近邻(KNN,n = 4),支持向量机(SVM,n = 7),随机森林(n = 7),逻辑回归(n = 3),朴素贝叶斯(n = 2),极端梯度增强(n = 3),光梯度增强机器(LightGBM,n = 2)和多元回归分析(MLP,n = 3)。 深度学习模型包括ResNet-152、Yolo v5、EfficientNet-B7、Inception ResNetV 2、Inception V3、VGG-16、DenseNet-169、单短检测器(SSD)、Xception、ResNet-101、MobileNetV 2、DenseNet-121、ConvNeXt、使用特权信息学习(LUPI)、TensorFlow等。

Table 1 Characteristics of individual studies
表1单项研究的特点

Risk of bias of included studies
纳入研究的偏倚风险

The results of the quality assessment of the involved studies are presented in Figs. 2 and 3. No study had a low risk of bias in all domains where “Patient Selection” and “Index Test” suffered greatly. 25 studies (89.29%) proved at high risk in the “Index Test” domain because they didn’t perform any robustness or sensitivity analysis of their models. Details about the characteristics and answers to signal questions of each study were available on Page 13–62 of Online Resource.
所涉及研究的质量评估结果见图1和图2。2和3没有一项研究在“患者选择”和“索引测试”受到严重影响的所有领域都有低偏倚风险。25项研究(89.29%)在“指数检验”领域被证明是高风险的,因为他们没有对其模型进行任何稳健性或敏感性分析。关于每项研究的特征和信号问题答案的详细信息,请参见在线资源的第13-62页。

Fig. 2 图2
figure 2

Risk of bias and applicability concerns graph of included studies
纳入研究的偏倚风险和适用性关系图

Fig. 3 图3
figure 3

Risk of bias and applicability concerns summary of included studies
纳入研究的偏倚风险和适用性问题总结

Effects of diagnostic tests
诊断测试的影响

Results of individual studies
个体研究结果

In the classification of DJD, the highest specificity was 94% using fine-tuned VGG16 [22]. The highest sensitivity was 100% using SVM, random forest, logistic regression [29] and Yolo v5 [16].
在DJD的分类中,使用微调VGG16的最高特异性为94%[ 22]。使用SVM、随机森林、逻辑回归[ 29]和Yolo v5 [ 16]的最高灵敏度为100%。

In the classification of DD, the highest sensitivity was 100% using Inception v3 [21] and specificity was 91.8% using ANN [31].
在DD分类中,使用Inception v3 [ 21]的最高灵敏度为100%,使用ANN [ 31]的特异性为91.8%。

Results of syntheses 合成结果

If there were several models in one study, we selected the best result for the analyses according to specificity, sensitivity, and accuracy. The results of these three models were shown in Figs. 45, and 6. Comparisons of diagnostic accuracy in all included studies were:
如果一项研究中有多个模型,我们根据特异性、灵敏度和准确性选择最佳结果进行分析。这三个模型的结果示于图1A和图1B中。4、5和6。所有纳入研究的诊断准确性比较如下:

  • Group 1: Diagnosis of DJD with CBCT using random forest [10, 18, 24].
    第1组:使用随机森林的CBCT诊断DJD [10,18,24]。

There was a little statistical heterogeneity found within included studies. The pooled sensitivity and specificity were 0.745 (0.660–0.814, I2 = 47.67%, P = 0.125), and 0.770 (0.700–0.828, I2 = 33.86%, P = 0.209).
在纳入的研究中发现了一点统计学异质性。合并的灵敏度和特异性分别为0.745(0.660—0.814,I 2 = 47.67%,P = 0.125)和0.770(0.700—0.828,I 2 = 33.86%,P = 0.209)。

  • Group 2: Diagnosis of DJD with CBCT using XGBoost.
    第2组:使用XGBoost的CBCT诊断DJD。

There was no statistical heterogeneity found and the pooled sensitivity and specificity were 0.765 (0.686–0.829, I2 = 0%, P = 0.467), and 0.766 (0.688–0.830, I2 = 0%, P = 0.592).
未发现统计学异质性,合并的灵敏度和特异性分别为0.765(0.686-0.829,I 2 = 0%,P = 0.467)和0.766(0.688-0.830,I 2 = 0%,P = 0.592)。

  • Group 3: Diagnosis of DJD with CBCT using LightGBM [10, 24, 27].
    第3组:使用LightGBM的CBCT诊断DJD [ 10,24,27]。

There was no statistical heterogeneity found and the pooled sensitivity and specificity were 0.781 (0.704–0.843, I2 = 0%, P = 0.683), and 0.781 (0.704–0.843, I2 = 0%, P = 0.683). 
未发现统计学异质性,合并灵敏度和特异性分别为0.781(0.704-0.843,I 2 = 0%,P = 0.683)和0.781(0.704-0.843,I 2 = 0%,P = 0.683)。

Fig. 4 图4
figure 4

Pooled sensitivity and specificity of random forest
随机森林的合并灵敏度和特异性

Fig. 5 图5
figure 5

Pooled sensitivity and specificity of XGBoost
XGBoost的合并灵敏度和特异性

Fig. 6 图6
figure 6

Pooled sensitivity and specificity of LightGBM
LightGBM的合并灵敏度和特异性

We didn’t perform analyses other algorithms because these studies targeted at different types of medical images and diseases.
我们没有对其他算法进行分析,因为这些研究针对不同类型的医学图像和疾病。

All evidence was graded as very low due to imprecision and high risk of bias. Thus, it was still uncertain about which model performed better in the diagnosis of TMDs.
由于不精确和高偏倚风险,所有证据均被评为非常低。因此,仍然不确定哪种模型在诊断TMDs方面表现更好。

Discussion 讨论

TMDs are a common musculoskeletal disease with joint pain, joint sounds and degenerative changes which can influence lives of patients, cause psychological problems, and increase healthcare cost. However, accurate diagnosis of TMDs is a challenging task for general dentists and physicians. Machine learning especially deep learning has more advantages than clinicians according to recent research [20].
TMD是一种常见的肌肉骨骼疾病,具有关节疼痛、关节声音和退行性变化,可影响患者的生活,引起心理问题,并增加医疗费用。然而,对一般牙医和医生来说,准确诊断TMD是一项具有挑战性的任务。根据最近的研究,机器学习尤其是深度学习比临床医生有更多的优势[ 20]。

The present study is the most comprehensive meta-analysis on the performance of machine learning algorithms in TMDs. We searched twelve databases and two online platforms, got 1,660 records totally and finally undertook three meta-analyses with random forest, LightGBM and XGBoost. The pooled results were barely satisfactory and some models such as Yolo v5 and Inception V3 can even reach 100% in sensitivity, which are extremely promising in the diagnosis of TMDs. At the same time, we assessed the bias of risk and applicability of the included studies and found that some studies had high risks of bias and most of the articles had low concerns of applicability.
本研究是对TMD中机器学习算法性能的最全面的荟萃分析。我们检索了12个数据库和2个在线平台,共获得1,660条记录,最后使用随机森林、LightGBM和XGBoost进行了3次荟萃分析。汇总结果勉强令人满意,一些模型如Yolo v5和Inception V3的灵敏度甚至可以达到100%,这在诊断TMD方面非常有希望。同时,我们对纳入研究的风险偏倚和适用性进行了评估,发现部分研究的偏倚风险较高,而大多数文献的适用性关注度较低。

Interpretation of the results
结果解读

  • Group 1: Diagnosis of DJD with CBCT using random forest.
    第1组:使用随机森林的CBCT诊断DJD。

The pooled specificity of random forest was barely satisfactory (> 0.75). But random forest was not the best model in Banchi 2020, Cai 2023, Haghnegahdar 2022, and Le 2021. The sensitivity of SVM in Cai 2023 achieved 0.86. Therefore, random forest was not recommended in diagnosing TMJ DJD with CBCT.
随机森林的合并特异性勉强令人满意(> 0.75)。 但在Banchi 2020、Cai 2023、Haghnegahdar 2022和Le 2021中,随机森林不是最好的模型。支持向量机在蔡2023中的灵敏度达到0.86。因此,在CBCT诊断TMJ DJD时,不建议使用随机森林。

  • Group 2: Diagnosis of DJD with CBCT using XGBoost & Group 3: diagnosis of DJD with CBCT using LightGBM.
    第2组:使用XGBoost的CBCT诊断DJD;第3组:使用LightGBM的CBCT诊断DJD。

There was an interesting similarity in these 3 studies, that was the study design according to the sample size (92 patients), image type (CBCT) and target condition (osteoarthritis). Le 2021 and Mackie 2022 shared the same ethic approval while Banchi 2020 and Mackie 2022 supported by the same funding. It can be doubted whether they used the same dataset. If so, then Banchi 2020 cannot be considered as consecutive of patient enrolled even though they described as “we enrolled patients and subjects from January 2016 to December 2018”. Le 2021 might be reliable referring to the reference standard in Banchi 2020 and Mackie 2022.
这3项研究有一个有趣的相似之处,即根据样本量(92例患者)、图像类型(CBCT)和目标疾病(骨关节炎)的研究设计。Le 2021和Mackie 2022获得了相同的伦理批准,而Banchi 2020和Mackie 2022获得了相同的资金支持。可以怀疑他们是否使用了相同的数据集。如果是,则Banchi 2020不能被视为连续入组患者,即使他们描述为“我们从2016年1月至2018年12月入组了患者和受试者”。参考Banchi 2020和Mackie 2022中的参比标准品,Le 2021可能是可靠的。

XGBoost were not the best in Banchi 2020, Haghnegahdar 2022, and Le 2021. LightGBM and the combination of LightGBM and XGBoost in Banchi 2021 and Mackie 2022 achieved more than 80% diagnostic accuracy which might be promising in diagnosing DJD.
XGBoost在Banchi 2020,Haghnegahdar 2022和Le 2021中不是最好的。在Banchi 2021和Mackie 2022中,LightGBM以及LightGBM和XGBoost的组合实现了超过80%的诊断准确性,这可能是有希望的诊断DJD。

Several studies had compared the accuracy of TMDs experts and AI. Choi et al. [13] reported that experts showed a better performance than AI whether indeterminate DJD patients were divided into which trial. Jung et al. [16] set up 3 groups, including AI, TMDs specialists, and general dentists, and showed that the accuracy of AI was higher than that of specialists and general doctors. In another article, AI and an expert obtained the highest specificity and sensitivity, respectively. Overall, it was not clear whether the expert or the model was more accurate, but AI models might assist clinicians in the diagnosis of TMDs.
几项研究比较了TMDs专家和AI的准确性。Choi等人[ 13]报告称,无论将不确定的DJD患者分为哪项试验,专家的表现都优于AI。Jung等[ 16]设立了3个小组,包括AI、TMD专家和普通牙医,并表明AI的准确性高于专家和普通医生。在另一篇文章中,AI和专家分别获得了最高的特异性和灵敏度。总体而言,目前尚不清楚专家或模型是否更准确,但人工智能模型可能有助于临床医生诊断TMD。

In recent years, deep learning (DL) models performed better than conventional machine learning based methods in most computer vision [30] like classification, segmentation, and detection [38,39,40]. In this review, 11 studies used 9 convolution neural network models with different medical images, and the results were extremely satisfactory. Most of them seemed to meet the requirements of clinical diagnosis. Image segmentation is also significant to diagnose diseases and assess the quality of plans. Auto-segmentation techniques have been clustered into 3 generations of algorithms, with multiatlas based and hybrid techniques being considered the state-of-the-art [41]. It is helpful for segmentation tasks to obtain regions of interest (ROI) of original medical images which can reduce needless information like air and artifacts. Kao et al. [17] introduced U-Net to detect ROI in MRI slices and improved the performance of models in diagnosis of TMDs.
近年来,深度学习(DL)模型在大多数计算机视觉[30](如分类,分割和检测[38,39,40])中的表现优于传统的基于机器学习的方法。在这篇综述中,11项研究使用了9个卷积神经网络模型与不同的医学图像,结果非常令人满意。大多数符合临床诊断要求。图像分割对于疾病诊断和计划质量评估也具有重要意义。自动分割技术已被分为3代算法,其中基于多图谱的混合技术被认为是最先进的[41]。获取原始医学图像的感兴趣区域(ROI)有助于分割任务,可以减少不必要的信息,如空气和伪影。Kao等人[17]引入了U—Net来检测MRI切片中的ROI,并提高了模型在诊断TMD中的性能。

However, the application of CNN (convolutional neural network) in clinical environment still has a long way to go. The usage of CNN must be under ethical restrictions and strict reviews because the principle of deep learning is still a black box. Only a few studies provided explainable information about the diagnosis key point of models using grad-CAM. A research had shown that CAM highlights the discriminative object parts detected by the deep learning model [17]. Jung et al. reported that even if the classification accuracy of 2 neural networks was similar, a difference could be observed in explanatory power using grad-CAM [20]. In clinical decision situations, both dentists and patients require accurate diagnostic results and visualization information. In general, more high-quality evidence about model interpretability needs to be discovered.
然而,CNN(卷积神经网络)在临床环境中的应用还有很长的路要走。CNN的使用必须受到道德限制和严格审查,因为深度学习的原理仍然是一个黑匣子。只有少数研究提供了可解释的信息的诊断关键点的模型,使用梯度CAM。一项研究表明,CAM突出了深度学习模型检测到的有区别的对象部分[17]。Jung等人报告说,即使2个神经网络的分类精度相似,使用grad—CAM [20]也可以观察到解释能力的差异。在临床决策情况下,牙医和患者都需要准确的诊断结果和可视化信息。总的来说,需要发现更多关于模型可解释性的高质量证据。

In addition to those image modalities, ultrasound is also widely used in oral and maxillofacial surgery, such as evaluating glandular tumors, metastatic lymph nodes, and more. Eida et al. trained a model and raised the level of residents to that of experienced radiologists with ultrasound images of metastatic lymph nodes [42].
除了这些图像模式,超声还广泛用于口腔颌面外科,如评估腺体肿瘤,转移淋巴结等。Eida等人训练了一个模型,并将住院医生的水平提高到有经验的放射科医生的水平,并提供了转移性淋巴结的超声图像[ 42]。

Risk of bias and concerns regarding applicability
偏倚风险和适用性问题

According to QUADAS-2, we assessed the included studies about the bias of risk and applicability. Most of the included studies had high risks according to the overall results.
按照QUADAS-2标准对纳入研究的风险偏倚和适用性进行评估。根据总体结果,大多数纳入的研究具有高风险。

In terms of Patient Selection, some studies weren’t random or consecutive samples of patients enrolled, such as Le 2021 [24], a case-control study. Data imbalance were considered strictly because the prediction is of high uncertainty due to the imbalance of training sets. The model can also increase the importance of clinical decisions as the distribution of participants gets closer to the population [43]. Serious patients with a course of more than 10 years were excluded in Mackie 2022 [27], which got a high risk.
在患者选择方面,有些研究不是随机或连续入选患者样本,如Le 2021 [ 24],这是一项病例对照研究。由于训练集的不平衡性使得预测具有很高的不确定性,因此严格考虑了数据不平衡性。该模型还可以增加临床决策的重要性,因为参与者的分布更接近人群[ 43]。Mackie 2022 [ 27]中排除了病程超过10年的严重患者,其风险较高。

As for the domain of the Index Test, only six studies stated thresholds. Seven did not provide enough information about the models they used, and studies just listing the type and structure of the model could not be judged as low risk. More details encompassing hyperparameters, loss functions, and activation functions must be provided. Current researchers usually divided the sample into training seta, validation seta, and test seta for internal validation, but few studies conducted external validation which is determinate to the final performance in a real clinical environment [43].
至于指数检验的范围,只有6项研究说明了阈值。7项研究没有提供关于他们使用的模型的足够信息,仅仅列出模型的类型和结构的研究不能被判断为低风险。必须提供包含超参数、损失函数和激活函数的更多细节。目前的研究者通常将样本分为训练集、验证集和测试集进行内部验证,但很少有研究进行外部验证,这决定了在真实的临床环境中的最终性能[43]。

In the aspect of Reference Standard, four studies did not state reference standard they used, and three did not use gold standards. Larheim and colleagues suggested that CBCT and CT were reliable for the diagnosis of DJD because of their clearer visualization of condyles [3, 44]. But Kim 2020 [22] explored the diagnostic accuracy of DJD in panoramic images and took the annotation of two physicians on panoramic images as a gold standard. In the meanwhile, some studies used the interpretation of one specialist and cross-center validation and dental work experience were suggested to reduce the annotator bias [45].
在参比标准品方面,4项研究未说明其使用的参比标准品,3项研究未使用金标准品。Larheim及其同事认为,CBCT和CT对诊断DJD是可靠的,因为它们对髁状突的可视化更清晰[ 3,44]。但Kim 2020 [ 22]探索了DJD在全景图像中的诊断准确性,并将两位医生对全景图像的注释作为金标准。与此同时,一些研究使用了一位专家的解释,并建议进行跨中心验证和牙科工作经验,以减少注释者偏倚[ 45]。

Limitations of this review
本审查的局限性

There were some limitations in the involved studies. First, the diagnostic accuracy of models needs to be improved, especially sensitivity. Choi et al. reported that Trail 1 and Trail 2, including images with indeterminate DJD, could not obtain satisfactory sensitivity or specificity, which meant the ability of models to diagnose complex patients were not sufficient [13]. Some researchers reported that the sensitivity of diagnosis in DJD with panoramic images was approximately 0.5 less than the diagnosis with CT, and the destruction of bone tissue needs a period to be visible in panoramic images. However, considering the convenience of examination in clinics or the lack of TMDs specialists, it is still recommended to take panoramic images as a vital means of early diagnosis [27]. Second, image modalities, models, and target conditions of included studies were various. It is meaningless to synthesize the diagnostic results blindly in the cases of different participants, conditions, and image modalities. Therefore, the synthesis results in this study were limited, and only a few models were evaluated in the diagnosis of DJD. Furthermore, a few studies did not provide original research data, such as TN, TP, FN, and FN which obstructed this review to explore the diagnostic accuracy of more models. Third, most of the datasets of the included studies were from a single center. It is well known that various sources of data represent more diverse patients and images, increasing the robustness of models, but data with correct interpretation are mostly labeled by humans, taking plenty of time and resources. A public database should be established for researchers to share labeled data and train models.
所涉及的研究存在一定的局限性。首先,模型的诊断准确性有待提高,特别是灵敏度。Choi等人报告称,Trail 1和Trail 2,包括不确定DJD的图像,无法获得令人满意的灵敏度或特异性,这意味着模型诊断复杂患者的能力不足[ 13]。一些研究者报道,全景图像诊断DJD的灵敏度比CT诊断低约0.5,并且骨组织的破坏需要一段时间才能在全景图像中可见。但考虑到临床检查的方便性或缺乏TMD专家,仍建议将全景图像作为早期诊断的重要手段[ 27]。其次,纳入研究的图像形式、模型和靶条件各不相同。 在不同参与者、条件和图像模态的情况下,盲目地综合诊断结果是没有意义的。因此,本研究的综合结果是有限的,只有少数模型在DJD的诊断中进行了评价。此外,少数研究没有提供原始研究数据,如TN、TP、FN和FN,这阻碍了本综述探索更多模型的诊断准确性。第三,纳入研究的大部分数据集来自单个中心。众所周知,各种来源的数据代表了更多样化的患者和图像,增加了模型的鲁棒性,但具有正确解释的数据大多由人类标记,需要花费大量的时间和资源。应该建立一个公共数据库,供研究人员共享标记数据和训练模型。

In general, ML is promising in the accurate diagnosis of TMDs. In the future, more experiments in clinical environment should be carried out and more attention should be attached to 3D reconstruction and design of treatment plans. At the same time, based on ethical and moral constraints about right of privacy, more policies should be formulated to protect patients’ privacy and medical security.
一般来说,ML在准确诊断TMD方面是有希望的。今后应开展更多的临床实验,重视三维重建和治疗方案的设计。同时,基于隐私权的伦理道德约束,应制定更多的政策来保护患者的隐私和医疗安全。

Conclusions 结论

The present systematic review and meta-analysis showed that some types of machine learning algorithms might be satisfactory in the diagnosis of DJD and deep learning may be a promising tool. However, most studies had a high risk of bias in study design. Some datasets were too small leading to unrepresentative results. Further prospective clinical studies are recommended to make a reasonable design in Patient Selection, Index Test, and Reference Standard and select more accurate and precise ML methods to explore. We hope that some models will be developed to apply in a real clinical environment in the future.
目前的系统综述和荟萃分析表明,某些类型的机器学习算法在诊断DJD方面可能令人满意,深度学习可能是一种有前途的工具。然而,大多数研究在研究设计中存在较高的偏倚风险。有些数据集太小,导致结果不具有代表性。建议开展进一步的前瞻性临床研究,在患者选择、索引检测、参考标准等方面进行合理设计,选择更准确、更精密的ML方法进行探索。我们希望将来能开发出一些应用于真实的临床环境的模型。