Introduction 介绍

Temporomandibular disorders (TMDs) are defined as a set of musculoskeletal and neuromuscular conditions involving the masticatory musculature, the temporomandibular joint (TMJ), and/or surrounding tissues [1] and are the second most common chronic musculoskeletal disorder, affecting approximately 31% of adults and 11% of children and adolescents [2]. According to the lasted diagnostic criteria, TMDs include pain-related TMDs and intra-articular TMDs. Arthralgia, myalgia, local myalgia, myofascial pain with referral, and headache attributed to TMDs are common subgroups of pain-related TMDs while disc replacement, degenerative joint disease, subluxation are of intra-articular TMDs [3]. TMDs encompass symptoms such as joint pain, limitation or deviation of jaw movement, and joint sounds which usually impact the physical and psychological well-being of patients. Due to the rapid development of society, more and more individuals are in the condition of stress and anxiety which would contribute to the progression of TMDs [4].
颞下颌关节紊乱病(TMD)定义为一组涉及咀嚼肌、颞下颌关节(TMJ)和/或周围组织的肌肉骨骼和神经肌肉疾病[ 1],是第二常见的慢性肌肉骨骼疾病,影响约31%的成人和11%的儿童和青少年[ 2]。根据最新的诊断标准,TMDs包括疼痛相关性TMDs和关节内TMDs。关节痛、肌痛、局部肌痛、转诊肌筋膜痛和归因于TMD的头痛是疼痛相关TMD的常见亚组,而椎间盘置换、退行性关节病、半脱位属于关节内TMD [ 3]。TMD包括关节疼痛、下颌运动受限或偏斜以及关节音等症状,这些症状通常会影响患者的身心健康。 由于社会的快速发展,越来越多的人处于压力和焦虑状态,这将有助于TMD的进展[ 4]。

The diagnosis of conditions relating to TMDs still pose a significant challenge. Medical imaging contains abundant information about diagnosis and treatment, and it is crucial to assist dentists in diagnosing TMDs. However, due to the difficulty in explaining medical images or inexperience, senior dentists cannot make an immediate or correct diagnosis usually. For example, magnetic resonance imaging (MRI) shows joint discs clearly but is not a routine image in oral examination, which means some dentists may not be trained to interpret temporomandibular joint MRI. Panoramic radiography is a common type of medical images in clinics because of the small amount of radiation and low cost. However, condyles are easy to neglect due to overlapping structures, or dentists may ignore the area of TMJ because early condyle bone change is difficult to observe.

Recently, the application of artificial intelligence (AI) to visual tasks, known as computer vision, has generated significant interest within the medical community [5]. Machine learning (ML), an important subset of AI, is widely used in medicine and can use a large amount of data to learn the mapping between features and outputs to solve complex problems [6]. ML includes supervised and unsupervised machine learning, while the former uses a labeled dataset to train models, and the latter directly extracts features from unlabeled data. Traditional supervised ML algorithms include random forests, logistic regression, support vector machines (SVM), and more. Deep learning (DL) is a special type of ML constructed by simulating the connection of human brain neurons, which has advantages in processing complex medical images. Unsupervised ML includes principal component analysis and cluster analysis. Therefore, ML is an appropriate way to improve the diagnostic accuracy of TMDs. Currently, ML algorithms are widely used in dental radiology, and studies on the diagnosis of TMJ using ML have gradually increased. Many studies have discussed the accuracy of ML algorithms in TMDs diagnosis using different medical images. However, the misdiagnosis of diseases will bring serious losses to patients, and the application of machine learning in clinical environments may cause ethical issues [7]. Therefore, it is very important to evaluate the accuracy of models in the diagnosis of TMDs. This paper reviewed research on the intelligent diagnosis of TMDs in several types of medical images using ML models and we hope it will be beneficial to future research.
最近,人工智能(AI)在视觉任务中的应用(称为计算机视觉)在医学界引起了极大的兴趣[ 5]。机器学习(ML)是AI的重要子集,广泛应用于医学领域,可以使用大量数据来学习特征和输出之间的映射,以解决复杂问题[ 6]。ML包括监督和无监督机器学习,前者使用标记的数据集来训练模型,后者直接从未标记的数据中提取特征。传统的监督ML算法包括随机森林,逻辑回归,支持向量机(SVM)等。深度学习(DL)是通过模拟人脑神经元的连接而构建的一种特殊类型的ML,在处理复杂的医学图像方面具有优势。无监督ML包括主成分分析和聚类分析。因此,ML是提高TMD诊断准确性的合适方法。 目前,最大似然算法在牙科放射学中得到广泛应用,利用最大似然算法诊断颞下颌关节疾病的研究也逐渐增多。许多研究已经讨论了ML算法在不同医学图像的TMD诊断中的准确性。然而,疾病的误诊会给患者带来严重的损失,机器学习在临床环境中的应用可能会引起伦理问题[7]。因此,评价模型的准确性在TMDs诊断中具有重要意义。本文综述了利用ML模型对几种类型的医学图像中的TMD进行智能诊断的研究,希望对未来的研究有所帮助。

Methods 方法

A protocol was registered online in the PROSPERO (ID: CRD42023395128), and the protocol and this review followed the PRISMA-DTA statement [8] and Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy.
在PROSPERO(ID:CRD 42023395128)中在线注册了一项方案,该方案和本综述遵循PRISMA—DTA声明[8]和诊断检测准确性系统综述科克伦手册。

Eligibility criteria 资格标准

Inclusion criteria were as follows:

  1. (1)

    Population: Temporomandibular disorder patients and/or healthy participants.

  2. (2)

    Index tests: Any diagnostic test based on machine learning (including deep learning).

  3. (3)

    Reference standard: Clinical diagnosis by physicians.

  4. (4)

    Target condition: Pain-related TMDs and intra-articular TMDs including arthralgia, myalgia, local myalgia, myofascial pain with referral, disc replacement, degenerative joint disease of temporomandibular joints.

  5. (5)

    Study design: Single diagnostic test accuracy studies (SDTA) and comparative diagnostic test accuracy studies (CDTA).

Exclusion criteria were as follows:

  1. (1)

    Intervention studies, etiological studies, or prognostic studies;

  2. (2)

    Reports focused on other clinical questions;

  3. (3)

    Ongoing studies; and 正在进行的研究;以及

  4. (4)

    Reports without eligible outcomes (i.e., abstracts or protocols published only).

Search methods  搜索方法

Twelve databases were searched up to 19 July 2023 for published and unpublished reports: (1) Europe PubMed Central (Europe PMC), (2) Embase via Ovid, (3) Evidence-Based Medicine Reviews (EBM Reviews) via Ovid, (4) Scopus, (5) Web of Science Core Collection (WOSCC), (6) Information Service in Physics, Electro-Technology and Computer and Control (Inspec), (7) Korea Citation Index (KCI), (8) Scientific Electronic Library Online (SciELO), (9) WHO Global Index Medicus (GIM), (10), (11) Open Science Framework Preprints (OSF Preprints), and (12) IEEE Xplore. Moreover, two register platforms were searched up to 19 July 2023 for registered clinical trials as well: (1) WHO International Clinical Trials Registry Platform (ICTRP), and (2) There were no restrictions on language or publication date. Furthermore, a cited reference search was conducted based on the included studies. More details about search strategies are available on the page 5 to 12 of the Online Resource.
截至2023年7月19日,检索了12个数据库中的已发表和未发表报告:(1)欧洲PubMed Central(欧洲PMC),(2)Embase通过奥维德,(3)循证医学综述(循证医学评论)通过奥维德,(4)Scopus,(5)科学网核心收藏(WOSCC),(6)物理,电子技术和计算机与控制信息服务(Inspec),(7)韩国引文索引(KCI),(8)科学电子图书馆在线(SciELO),(9)WHO全球医学索引(GIM),(10),(11)开放科学框架预印本(OSF预印本),和(12)IEEE Xplore。此外,截至2023年7月19日,还检索了两个注册平台,以查找已注册的临床试验:(1)WHO国际临床试验注册平台(ICTRP)和(2)。此外,基于纳入的研究进行了引用参考文献检索。 有关搜索策略的更多详细信息,请参阅在线资源的第5至12页。

Selection and data collection of studies

Two review authors screened the title and abstracts of each record retrieved independently and then obtained and assessed the full reports for all studies that appeared to meet the inclusion criteria, and any disagreements were resolved by discussion or the involvement of another review author as an arbiter.

Two review authors extracted data from included studies independently and resolved their disagreements by discussion or the involvement of another review author as an arbiter. The following characteristics were extracted from included studies: methods (study designs, periods, locations, setting, and funding sources), participants (inclusion and exclusion criteria, number of patients, gender, and age), index tests (machine learning models), reference standards, outcomes, and other information.

Assessment of methodological quality in included studies

Two review authors assessed risk of bias (internal validity) and applicability (external validity) in included studies independently and resolved their disagreements by discussion or the involvement of another review author as an arbiter. The methodological quality in SDTA studies was assessed with the QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) [9]. The signal questions used in the present study were reported on Page 63 of Online Resource.
两位综述作者独立评估了纳入研究的偏倚风险(内部效度)和适用性(外部效度),并通过讨论或另一位综述作者作为仲裁者的参与解决了他们的分歧。采用QUADAS-2(诊断准确性研究质量评估)评估SDTA研究的方法学质量[ 9]。本研究中使用的信号问题报告在在线资源的第63页。

Effect measures 有效措施

For each included study, it was calculated sensitivity and specificity with 95% confidence intervals (CI), at individual test level for thresholds of interest. Primary thresholds of interest were based on the thresholds reported in the original included studies.

Synthesis methods 合成方法

If there were not more than four studies included in the individual index test level, it would undertake meta-analyses to pool the sensitivity and specificity using the random-effects model. These data were visualized using forest plots of the sensitivity and specificity.

Assessment of reporting bias

If there had been more than ten studies in the same synthesis, the publication bias would be assessed by visually inspecting a funnel plot for asymmetry.

Results 结果

Selection of studies 研究的选择

As shown in Fig. 1, a total of 1660 records were retrieved from twelve databases and two registry platforms. It was screened titles and abstracts of all the records and removed 386 duplicates. 1274 records were screened on EndNote Desktop and online. Then it was identified 29 reports after excluding 23 reports following the full-text review. In total, 28 studies (29 reports) were included in this systematic review. Details about search strategies and the selection process of studies are available on Page 3–12 of Online Resource.
如图1所示,从12个数据库和2个登记平台中共检索到1660条记录。筛选了所有记录的标题和摘要,删除了386份重复记录。在EndNote Desktop和在线上筛选了1274条记录。然后在全文审查后排除23份报告后,确定了29份报告。本次系统性综述共纳入28项研究(29份报告)。有关搜索策略和研究选择过程的详细信息,请参阅在线资源的第3—12页。

Fig. 1 图1
figure 1

PRISMA 2020 flow diagram for this review
PRISMA 2020审查流程图

Characteristics and risk-of-bias of included studies

Study characteristics 研究特征

Characteristics of individual studies were summarized in Table 1. These studies used different types of the medical image including magnetic resonance imaging (MRI, n = 8), panoramic radiographs (n = 4), cone-beam computed tomography (CBCT, n = 11) and other image modalities (n = 5). Included studies focused on various diseases: temporomandibular joint DJD (n = 16), DD (n = 6), joint perforation (n = 1), joint osteoporosis (n = 1), arthropathy and myopathy (n = 1) and TMDs (n = 4). Many studies used more than one algorithms; deep learning (n = 15), k-nearest neighbor (KNN, n = 4), support vector machine (SVM, n = 7), random forest(n = 7), logistic regression (n = 3), naïve Bayesian (n = 2), extreme gradient boosting (n = 3), light gradient boosting machine (LightGBM, n = 2), and multiple regression analysis (MLP, n = 3). Deep learning models included ResNet-152, Yolo v5, EfficientNet-B7, Inception ResNetV2, Inception V3, VGG-16, DenseNet-169, single-short detector (SSD), Xception, ResNet-101, MobileNetV2, DenseNet-121, ConvNeXt, learning using privileged information (LUPI), TensorFlow and so on.
各项研究的特征总结见表1。这些研究使用了不同类型的医学图像,包括磁共振成像(MRI,n = 8)、全景X线片(n = 4)、锥形束计算机断层扫描(CBCT,n = 11)和其他图像模态(n = 5)。纳入的研究集中于各种疾病:颞下颌关节DJD(n = 16)、DD(n = 6)、关节穿孔(n = 1)、关节骨质疏松症(n = 1)、关节病和肌病(n = 1)和TMD(n = 4)。许多研究使用了不止一种算法;深度学习(n = 15),k-最近邻(KNN,n = 4),支持向量机(SVM,n = 7),随机森林(n = 7),逻辑回归(n = 3),朴素贝叶斯(n = 2),极端梯度增强(n = 3),光梯度增强机器(LightGBM,n = 2)和多元回归分析(MLP,n = 3)。 深度学习模型包括ResNet-152、Yolo v5、EfficientNet-B7、Inception ResNetV 2、Inception V3、VGG-16、DenseNet-169、单短检测器(SSD)、Xception、ResNet-101、MobileNetV 2、DenseNet-121、ConvNeXt、使用特权信息学习(LUPI)、TensorFlow等。

Table 1 Characteristics of individual studies

Risk of bias of included studies

The results of the quality assessment of the involved studies are presented in Figs. 2 and 3. No study had a low risk of bias in all domains where “Patient Selection” and “Index Test” suffered greatly. 25 studies (89.29%) proved at high risk in the “Index Test” domain because they didn’t perform any robustness or sensitivity analysis of their models. Details about the characteristics and answers to signal questions of each study were available on Page 13–62 of Online Resource.

Fig. 2 图2
figure 2

Risk of bias and applicability concerns graph of included studies

Fig. 3 图3
figure 3

Risk of bias and applicability concerns summary of included studies

Effects of diagnostic tests

Results of individual studies

In the classification of DJD, the highest specificity was 94% using fine-tuned VGG16 [22]. The highest sensitivity was 100% using SVM, random forest, logistic regression [29] and Yolo v5 [16].
在DJD的分类中,使用微调VGG16的最高特异性为94%[ 22]。使用SVM、随机森林、逻辑回归[ 29]和Yolo v5 [ 16]的最高灵敏度为100%。

In the classification of DD, the highest sensitivity was 100% using Inception v3 [21] and specificity was 91.8% using ANN [31].
在DD分类中,使用Inception v3 [ 21]的最高灵敏度为100%,使用ANN [ 31]的特异性为91.8%。

Results of syntheses 合成结果

If there were several models in one study, we selected the best result for the analyses according to specificity, sensitivity, and accuracy. The results of these three models were shown in Figs. 45, and 6. Comparisons of diagnostic accuracy in all included studies were:

  • Group 1: Diagnosis of DJD with CBCT using random forest [10, 18, 24].
    第1组:使用随机森林的CBCT诊断DJD [10,18,24]。

There was a little statistical heterogeneity found within included studies. The pooled sensitivity and specificity were 0.745 (0.660–0.814, I2 = 47.67%, P = 0.125), and 0.770 (0.700–0.828, I2 = 33.86%, P = 0.209).
在纳入的研究中发现了一点统计学异质性。合并的灵敏度和特异性分别为0.745(0.660—0.814,I 2 = 47.67%,P = 0.125)和0.770(0.700—0.828,I 2 = 33.86%,P = 0.209)。

  • Group 2: Diagnosis of DJD with CBCT using XGBoost.

There was no statistical heterogeneity found and the pooled sensitivity and specificity were 0.765 (0.686–0.829, I2 = 0%, P = 0.467), and 0.766 (0.688–0.830, I2 = 0%, P = 0.592).
未发现统计学异质性,合并的灵敏度和特异性分别为0.765(0.686-0.829,I 2 = 0%,P = 0.467)和0.766(0.688-0.830,I 2 = 0%,P = 0.592)。

  • Group 3: Diagnosis of DJD with CBCT using LightGBM [10, 24, 27].
    第3组:使用LightGBM的CBCT诊断DJD [ 10,24,27]。

There was no statistical heterogeneity found and the pooled sensitivity and specificity were 0.781 (0.704–0.843, I2 = 0%, P = 0.683), and 0.781 (0.704–0.843, I2 = 0%, P = 0.683). 
未发现统计学异质性,合并灵敏度和特异性分别为0.781(0.704-0.843,I 2 = 0%,P = 0.683)和0.781(0.704-0.843,I 2 = 0%,P = 0.683)。

Fig. 4 图4
figure 4

Pooled sensitivity and specificity of random forest

Fig. 5 图5
figure 5

Pooled sensitivity and specificity of XGBoost

Fig. 6 图6
figure 6

Pooled sensitivity and specificity of LightGBM

We didn’t perform analyses other algorithms because these studies targeted at different types of medical images and diseases.

All evidence was graded as very low due to imprecision and high risk of bias. Thus, it was still uncertain about which model performed better in the diagnosis of TMDs.

Discussion 讨论

TMDs are a common musculoskeletal disease with joint pain, joint sounds and degenerative changes which can influence lives of patients, cause psychological problems, and increase healthcare cost. However, accurate diagnosis of TMDs is a challenging task for general dentists and physicians. Machine learning especially deep learning has more advantages than clinicians according to recent research [20].
TMD是一种常见的肌肉骨骼疾病,具有关节疼痛、关节声音和退行性变化,可影响患者的生活,引起心理问题,并增加医疗费用。然而,对一般牙医和医生来说,准确诊断TMD是一项具有挑战性的任务。根据最近的研究,机器学习尤其是深度学习比临床医生有更多的优势[ 20]。

The present study is the most comprehensive meta-analysis on the performance of machine learning algorithms in TMDs. We searched twelve databases and two online platforms, got 1,660 records totally and finally undertook three meta-analyses with random forest, LightGBM and XGBoost. The pooled results were barely satisfactory and some models such as Yolo v5 and Inception V3 can even reach 100% in sensitivity, which are extremely promising in the diagnosis of TMDs. At the same time, we assessed the bias of risk and applicability of the included studies and found that some studies had high risks of bias and most of the articles had low concerns of applicability.
本研究是对TMD中机器学习算法性能的最全面的荟萃分析。我们检索了12个数据库和2个在线平台,共获得1,660条记录,最后使用随机森林、LightGBM和XGBoost进行了3次荟萃分析。汇总结果勉强令人满意,一些模型如Yolo v5和Inception V3的灵敏度甚至可以达到100%,这在诊断TMD方面非常有希望。同时,我们对纳入研究的风险偏倚和适用性进行了评估,发现部分研究的偏倚风险较高,而大多数文献的适用性关注度较低。

Interpretation of the results

  • Group 1: Diagnosis of DJD with CBCT using random forest.

The pooled specificity of random forest was barely satisfactory (> 0.75). But random forest was not the best model in Banchi 2020, Cai 2023, Haghnegahdar 2022, and Le 2021. The sensitivity of SVM in Cai 2023 achieved 0.86. Therefore, random forest was not recommended in diagnosing TMJ DJD with CBCT.
随机森林的合并特异性勉强令人满意(> 0.75)。 但在Banchi 2020、Cai 2023、Haghnegahdar 2022和Le 2021中,随机森林不是最好的模型。支持向量机在蔡2023中的灵敏度达到0.86。因此,在CBCT诊断TMJ DJD时,不建议使用随机森林。

  • Group 2: Diagnosis of DJD with CBCT using XGBoost & Group 3: diagnosis of DJD with CBCT using LightGBM.

There was an interesting similarity in these 3 studies, that was the study design according to the sample size (92 patients), image type (CBCT) and target condition (osteoarthritis). Le 2021 and Mackie 2022 shared the same ethic approval while Banchi 2020 and Mackie 2022 supported by the same funding. It can be doubted whether they used the same dataset. If so, then Banchi 2020 cannot be considered as consecutive of patient enrolled even though they described as “we enrolled patients and subjects from January 2016 to December 2018”. Le 2021 might be reliable referring to the reference standard in Banchi 2020 and Mackie 2022.
这3项研究有一个有趣的相似之处,即根据样本量(92例患者)、图像类型(CBCT)和目标疾病(骨关节炎)的研究设计。Le 2021和Mackie 2022获得了相同的伦理批准,而Banchi 2020和Mackie 2022获得了相同的资金支持。可以怀疑他们是否使用了相同的数据集。如果是,则Banchi 2020不能被视为连续入组患者,即使他们描述为“我们从2016年1月至2018年12月入组了患者和受试者”。参考Banchi 2020和Mackie 2022中的参比标准品,Le 2021可能是可靠的。

XGBoost were not the best in Banchi 2020, Haghnegahdar 2022, and Le 2021. LightGBM and the combination of LightGBM and XGBoost in Banchi 2021 and Mackie 2022 achieved more than 80% diagnostic accuracy which might be promising in diagnosing DJD.
XGBoost在Banchi 2020,Haghnegahdar 2022和Le 2021中不是最好的。在Banchi 2021和Mackie 2022中,LightGBM以及LightGBM和XGBoost的组合实现了超过80%的诊断准确性,这可能是有希望的诊断DJD。

Several studies had compared the accuracy of TMDs experts and AI. Choi et al. [13] reported that experts showed a better performance than AI whether indeterminate DJD patients were divided into which trial. Jung et al. [16] set up 3 groups, including AI, TMDs specialists, and general dentists, and showed that the accuracy of AI was higher than that of specialists and general doctors. In another article, AI and an expert obtained the highest specificity and sensitivity, respectively. Overall, it was not clear whether the expert or the model was more accurate, but AI models might assist clinicians in the diagnosis of TMDs.
几项研究比较了TMDs专家和AI的准确性。Choi等人[ 13]报告称,无论将不确定的DJD患者分为哪项试验,专家的表现都优于AI。Jung等[ 16]设立了3个小组,包括AI、TMD专家和普通牙医,并表明AI的准确性高于专家和普通医生。在另一篇文章中,AI和专家分别获得了最高的特异性和灵敏度。总体而言,目前尚不清楚专家或模型是否更准确,但人工智能模型可能有助于临床医生诊断TMD。

In recent years, deep learning (DL) models performed better than conventional machine learning based methods in most computer vision [30] like classification, segmentation, and detection [38,39,40]. In this review, 11 studies used 9 convolution neural network models with different medical images, and the results were extremely satisfactory. Most of them seemed to meet the requirements of clinical diagnosis. Image segmentation is also significant to diagnose diseases and assess the quality of plans. Auto-segmentation techniques have been clustered into 3 generations of algorithms, with multiatlas based and hybrid techniques being considered the state-of-the-art [41]. It is helpful for segmentation tasks to obtain regions of interest (ROI) of original medical images which can reduce needless information like air and artifacts. Kao et al. [17] introduced U-Net to detect ROI in MRI slices and improved the performance of models in diagnosis of TMDs.

However, the application of CNN (convolutional neural network) in clinical environment still has a long way to go. The usage of CNN must be under ethical restrictions and strict reviews because the principle of deep learning is still a black box. Only a few studies provided explainable information about the diagnosis key point of models using grad-CAM. A research had shown that CAM highlights the discriminative object parts detected by the deep learning model [17]. Jung et al. reported that even if the classification accuracy of 2 neural networks was similar, a difference could be observed in explanatory power using grad-CAM [20]. In clinical decision situations, both dentists and patients require accurate diagnostic results and visualization information. In general, more high-quality evidence about model interpretability needs to be discovered.
然而,CNN(卷积神经网络)在临床环境中的应用还有很长的路要走。CNN的使用必须受到道德限制和严格审查,因为深度学习的原理仍然是一个黑匣子。只有少数研究提供了可解释的信息的诊断关键点的模型,使用梯度CAM。一项研究表明,CAM突出了深度学习模型检测到的有区别的对象部分[17]。Jung等人报告说,即使2个神经网络的分类精度相似,使用grad—CAM [20]也可以观察到解释能力的差异。在临床决策情况下,牙医和患者都需要准确的诊断结果和可视化信息。总的来说,需要发现更多关于模型可解释性的高质量证据。

In addition to those image modalities, ultrasound is also widely used in oral and maxillofacial surgery, such as evaluating glandular tumors, metastatic lymph nodes, and more. Eida et al. trained a model and raised the level of residents to that of experienced radiologists with ultrasound images of metastatic lymph nodes [42].
除了这些图像模式,超声还广泛用于口腔颌面外科,如评估腺体肿瘤,转移淋巴结等。Eida等人训练了一个模型,并将住院医生的水平提高到有经验的放射科医生的水平,并提供了转移性淋巴结的超声图像[ 42]。

Risk of bias and concerns regarding applicability

According to QUADAS-2, we assessed the included studies about the bias of risk and applicability. Most of the included studies had high risks according to the overall results.

In terms of Patient Selection, some studies weren’t random or consecutive samples of patients enrolled, such as Le 2021 [24], a case-control study. Data imbalance were considered strictly because the prediction is of high uncertainty due to the imbalance of training sets. The model can also increase the importance of clinical decisions as the distribution of participants gets closer to the population [43]. Serious patients with a course of more than 10 years were excluded in Mackie 2022 [27], which got a high risk.
在患者选择方面,有些研究不是随机或连续入选患者样本,如Le 2021 [ 24],这是一项病例对照研究。由于训练集的不平衡性使得预测具有很高的不确定性,因此严格考虑了数据不平衡性。该模型还可以增加临床决策的重要性,因为参与者的分布更接近人群[ 43]。Mackie 2022 [ 27]中排除了病程超过10年的严重患者,其风险较高。

As for the domain of the Index Test, only six studies stated thresholds. Seven did not provide enough information about the models they used, and studies just listing the type and structure of the model could not be judged as low risk. More details encompassing hyperparameters, loss functions, and activation functions must be provided. Current researchers usually divided the sample into training seta, validation seta, and test seta for internal validation, but few studies conducted external validation which is determinate to the final performance in a real clinical environment [43].

In the aspect of Reference Standard, four studies did not state reference standard they used, and three did not use gold standards. Larheim and colleagues suggested that CBCT and CT were reliable for the diagnosis of DJD because of their clearer visualization of condyles [3, 44]. But Kim 2020 [22] explored the diagnostic accuracy of DJD in panoramic images and took the annotation of two physicians on panoramic images as a gold standard. In the meanwhile, some studies used the interpretation of one specialist and cross-center validation and dental work experience were suggested to reduce the annotator bias [45].
在参比标准品方面,4项研究未说明其使用的参比标准品,3项研究未使用金标准品。Larheim及其同事认为,CBCT和CT对诊断DJD是可靠的,因为它们对髁状突的可视化更清晰[ 3,44]。但Kim 2020 [ 22]探索了DJD在全景图像中的诊断准确性,并将两位医生对全景图像的注释作为金标准。与此同时,一些研究使用了一位专家的解释,并建议进行跨中心验证和牙科工作经验,以减少注释者偏倚[ 45]。

Limitations of this review

There were some limitations in the involved studies. First, the diagnostic accuracy of models needs to be improved, especially sensitivity. Choi et al. reported that Trail 1 and Trail 2, including images with indeterminate DJD, could not obtain satisfactory sensitivity or specificity, which meant the ability of models to diagnose complex patients were not sufficient [13]. Some researchers reported that the sensitivity of diagnosis in DJD with panoramic images was approximately 0.5 less than the diagnosis with CT, and the destruction of bone tissue needs a period to be visible in panoramic images. However, considering the convenience of examination in clinics or the lack of TMDs specialists, it is still recommended to take panoramic images as a vital means of early diagnosis [27]. Second, image modalities, models, and target conditions of included studies were various. It is meaningless to synthesize the diagnostic results blindly in the cases of different participants, conditions, and image modalities. Therefore, the synthesis results in this study were limited, and only a few models were evaluated in the diagnosis of DJD. Furthermore, a few studies did not provide original research data, such as TN, TP, FN, and FN which obstructed this review to explore the diagnostic accuracy of more models. Third, most of the datasets of the included studies were from a single center. It is well known that various sources of data represent more diverse patients and images, increasing the robustness of models, but data with correct interpretation are mostly labeled by humans, taking plenty of time and resources. A public database should be established for researchers to share labeled data and train models.
所涉及的研究存在一定的局限性。首先,模型的诊断准确性有待提高,特别是灵敏度。Choi等人报告称,Trail 1和Trail 2,包括不确定DJD的图像,无法获得令人满意的灵敏度或特异性,这意味着模型诊断复杂患者的能力不足[ 13]。一些研究者报道,全景图像诊断DJD的灵敏度比CT诊断低约0.5,并且骨组织的破坏需要一段时间才能在全景图像中可见。但考虑到临床检查的方便性或缺乏TMD专家,仍建议将全景图像作为早期诊断的重要手段[ 27]。其次,纳入研究的图像形式、模型和靶条件各不相同。 在不同参与者、条件和图像模态的情况下,盲目地综合诊断结果是没有意义的。因此,本研究的综合结果是有限的,只有少数模型在DJD的诊断中进行了评价。此外,少数研究没有提供原始研究数据,如TN、TP、FN和FN,这阻碍了本综述探索更多模型的诊断准确性。第三,纳入研究的大部分数据集来自单个中心。众所周知,各种来源的数据代表了更多样化的患者和图像,增加了模型的鲁棒性,但具有正确解释的数据大多由人类标记,需要花费大量的时间和资源。应该建立一个公共数据库,供研究人员共享标记数据和训练模型。

In general, ML is promising in the accurate diagnosis of TMDs. In the future, more experiments in clinical environment should be carried out and more attention should be attached to 3D reconstruction and design of treatment plans. At the same time, based on ethical and moral constraints about right of privacy, more policies should be formulated to protect patients’ privacy and medical security.

Conclusions 结论

The present systematic review and meta-analysis showed that some types of machine learning algorithms might be satisfactory in the diagnosis of DJD and deep learning may be a promising tool. However, most studies had a high risk of bias in study design. Some datasets were too small leading to unrepresentative results. Further prospective clinical studies are recommended to make a reasonable design in Patient Selection, Index Test, and Reference Standard and select more accurate and precise ML methods to explore. We hope that some models will be developed to apply in a real clinical environment in the future.