Elsevier

Journal of Affective Disorders
情感障碍杂志

Volume 318, 1 December 2022, Pages 364-379
第 318 卷 ,2022 年 12 月 1 日,第 364-379 页
Journal of Affective Disorders

Review article  评论文章
Application of machine learning in predicting the risk of postpartum depression: A systematic review
机器学习在预测产后抑郁风险中的应用:系统评价

https://doi.org/10.1016/j.jad.2022.08.070Get rights and content  获取权利和内容
Full text access  全文访问

Highlights  突出

  • Review of machine learning techniques to predict postpartum depression risk
    预测产后抑郁风险的机器学习技术综述
  • Supervised learning was the main machine learning technique employed.
    监督学习是采用的主要机器学习技术。
  • Studies that involved model translation had not been tested clinically.
    涉及模型翻译的研究尚未经过临床测试。
  • All studies showed high bias risk and over half showed high application risk.
    所有研究均显示高偏倚风险,超过一半的研究显示高应用风险。

Abstract  抽象

Background  背景

Postpartum depression (PPD) presents a serious health problem among women and their families. Machine learning (ML) is a rapidly advancing field with increasing utility in predicting PPD risk. We aimed to synthesize and evaluate the quality of studies on application of ML techniques in predicting PPD risk.
产后抑郁症 (PPD) 在女性及其家庭中是一个严重的健康问题。机器学习 (ML) 是一个快速发展的领域,在预测 PPD 风险方面的实用性越来越大。我们旨在综合和评估 ML 技术在预测 PPD 风险中的应用研究质量。

Methods  方法

We conducted a systematic search of eight databases, identifying English and Chinese studies on ML techniques for predicting PPD risk and ML techniques with performance metrics. Quality of the studies involved was evaluated using the Prediction Model Risk of Bias Assessment Tool.
我们对 8 个数据库进行了系统检索,确定了关于用于预测 PPD 风险的 ML 技术和具有性能指标的 ML 技术的英文和中文研究。使用预测模型偏倚风险评估工具评估所涉及的研究的质量。

Results  结果

Seventeen studies involving 62 prediction models were included. Supervised learning was the main ML technique employed and the common ML models were support vector machine, random forest and logistic regression. Five studies (30 %) reported both internal and external validation. Two studies involved model translation, but none were tested clinically. All studies showed a high risk of bias, and more than half showed high application risk.
纳入了 17 项研究,涉及 62 个预测模型。监督学习是采用的主要 ML 技术,常见的 ML 模型是支持向量机 、随机森林和逻辑回归 。5 项研究(30 %)报告了内部和外部验证。两项研究涉及模型翻译,但没有一项进行临床测试。所有研究都显示出高偏倚风险,超过一半的研究显示出高应用风险。

Limitations  局限性

Including Chinese articles slightly reduced the reproducibility of the review. Model performance was not quantitatively analyzed owing to inconsistent metrics and the absence of methods for correlation meta-analysis.
纳入中文文章略微降低了评价的可重复性。由于指标不一致和缺乏相关性荟萃分析方法,没有对模型性能进行定量分析。

Conclusions  结论

Researchers have paid more attention to model development than to validation, and few have focused on improvement and innovation. Models for predicting PPD risk continue to emerge. However, few have achieved the acceptable quality standards. Therefore, ML techniques for successfully predicting PPD risk are yet to be deployed in clinical environments.
研究人员更关注模型开发而不是验证,很少有人关注改进和创新。预测 PPD 风险的模型不断涌现。然而,很少有公司达到可接受的质量标准。因此,用于成功预测 PPD 风险的 ML 技术尚未在临床环境中部署。

Keywords  关键字

Postpartum depression
Risk prediction models
Machine learning
Supervised learning

产后抑郁症
风险预测模型
机器学习
监督学习

1. Introduction  1. 引言

Postpartum depression (PPD) is a common mental complication that generally occurs within four weeks or longer during the postpartum period. PPD is characterized by a significant and constant decrease in spirits, loss of interest and pleasure, and decreased energy (ACOG, 2018; Zulauf Logoz, 2014). A meta-analysis of 291 studies conducted in 56 countries found that the global pooled prevalence of PPD was 17.7 % (95 % CI; 16.6 %–18.8 %), with the prevalence level ranging from 3 % to 38 % depending on the country (Hahn-Holbrook et al., 2017). Evidence suggests that PPD places both the mothers and their offspring at the increased risk for serious complications, including self-harm (Paul et al., 2021), suicidal ideation (Mare et al., 2021; Trost et al., 2021), and infanticide (Barr and Beck, 2008). PPD can also result in adverse health outcomes among infants, including being underweight, stunted growth (Farias-Antunez et al., 2018), delays in neuro-motor, language-related, and general cognitive development (Aoyagi and Tsuchiya, 2019; Koutra et al., 2013), and increased risk of autism spectrum disorder (Chen et al., 2021). Considering the serious harm it causes among mothers and children, PPD has become a substantial public health concern because it results in the increased burden of disease among women, families, and even entire countries. Therefore, to reduce the prevalence of PPD, interventions for early prevention and treatment should be implemented.
产后抑郁症 (PPD) 是一种常见的精神并发症,通常在后 4 周或更长时间内发生。PPD 的特征是精神水平显着且持续下降、兴趣和愉悦感丧失以及精力下降(ACOG,2018 年;Zulauf Logoz,2014 年)。对 291 个国家/地区进行的 56 项研究的荟萃分析发现,PPD 的全球汇总患病率为 17.7%(95 % CI;16.6 %–18.8 %),患病率水平从 3% 到 38% 不等,具体取决于国家/地区(Hahn-Holbrook 等人,2017 年)。有证据表明,PPD 使母亲及其后代患严重并发症的风险增加,包括自残(Paul 等人,2021 年)、自杀意念(Mare 等人,2021 年;Trost 等人,2021)和杀婴 Barr 和 Beck,2008 年)。PPD 还可能导致婴儿的不良健康结果,包括体重不足、 生长发育迟缓 Farias-Antunez 等人,2018 年)、神经运动、语言相关和一般认知发展延迟(Aoyagi 和 Tsuchiya,2019 年;Koutra 等人,2013),以及自闭症谱系障碍风险增加(Chen 等人,2021)。 考虑到 PPD 对母亲和儿童造成的严重伤害,PPD 已成为一个重大的公共卫生问题,因为它导致妇女、家庭甚至整个国家的疾病负担增加。因此,为了降低 PPD 的患病率,应实施早期预防和治疗的干预措施。
Although PPD is currently being focused upon, its pathophysiology is not clear enough to ensure prevention based on its etiological attributes. Furthermore, PPD has been proven to be influenced by multiple factors, including demographic (e.g., age and education) (Doke et al., 2021; Matsumura et al., 2019), psychosocial (e.g., social support and intimate violence) (Desta et al., 2021), biological (e.g., thyroid function and serum lipid) (Minaldi et al., 2020; Shi et al., 2020; Sylven et al., 2013), and obstetrics-related factors (e.g., preterm and cesarean births) (Baba et al., 2021; de Paula Eduardo et al., 2019). Additionally, occult-onset and heterogeneous symptom presentation make the early detection of PPD significantly challenging. Therefore, multiple researchers have attempted to integrate the related risk factors and develop prediction models for PPD, aiming to predict PPD risk and evaluating the effectiveness of approaches aimed at its prevention.
尽管目前正在关注 PPD,但其病理生理学还不够明确,无法根据其病因属性进行预防。此外,PPD 已被证明受多种因素影响,包括人口统计学(例如,年龄和教育程度)(Doke 等人,2021 年;Matsumura et al., 2019)、社会心理(例如, 社会支持和亲密暴力)(Desta et al., 2021)、生物学(例如 ,甲状腺功能和血脂)(Minaldiet al., 2020;Shi et al., 2020;Sylven 等人,2013 年)和产科相关因素(例如早产和剖宫产)(Baba 等人,2021 年;de Paula Eduardo et al., 2019)。此外,隐匿性发作和异质性症状表现使 PPD 的早期检测极具挑战性。因此,多位研究人员试图整合相关的风险因素并开发 PPD 预测模型,旨在预测 PPD 风险并评估旨在预防其的方法的有效性。
Machine learning (ML), a branch of artificial intelligence, is a general term encompassing the techniques used to predict and determine the trajectories of unknown phenomena through the extraction of hidden laws and distributions from input data. ML techniques can be divided into those based on supervised learning, unsupervised learning, and reinforcement learning (Shehab et al., 2022). Supervised learning refers to the use of labeled data to train models that can be used to predict the labels of new data, unsupervised learning refers to the learning of internal patterns and valid patterns from unlabeled data, and reinforcement learning involves the use of learning algorithms to make decisions based on the rules of a specific environment (Cho et al., 2022).
机器学习 (ML) 是人工智能的一个分支,是一个通用术语,包含用于通过从输入数据中提取隐藏规律和分布来预测和确定未知现象轨迹的技术。ML 技术可分为基于监督学习、无监督学习和强化学习的技术(Shehab et al., 2022)。监督学习是指使用标记数据来训练可用于预测新数据标签的模型,无监督学习是指从未标记的数据中学习内部模式和有效模式,强化学习涉及使用学习算法根据特定环境的规则做出决策(Cho et al., 2022 年)。
Over the past decades, ML techniques have been used in medical environments in the form of logistic regression and Cox regression. Such models are easy to interpret because they include features with assigned “weights.” However, researchers often over-interpret “weight” as causality rather than mere association. Currently, modern ML techniques involve reoriented central goals that allow for flexible prediction rather than mere explanation in medicine, thereby enabling the connection between the risk features and the outcomes to be mediated using black-box algorithms (Bishara et al., 2022). Over the recent years, ML techniques have been used extensively in disease risk prediction, such as cardiovascular disease (Wegner et al., 2022), delirium (Bishara et al., 2022), and diabetes (Zou et al., 2022).
在过去的几十年里,ML 技术以 logistic 回归和 Cox 回归的形式用于医疗环境。此类模型易于解释,因为它们包含具有指定 “权重” 的特征。然而,研究人员经常将 “体重” 过度解释为因果关系,而不仅仅是关联。目前,现代 ML 技术涉及重新定位中心目标,允许灵活的预测,而不仅仅是医学上的解释,从而能够使用黑盒算法来调解风险特征和结果之间的联系(Bishara et al., 2022)。近年来,ML 技术已广泛用于疾病风险预测,例如心血管疾病(Wegner 等人,2022 年)、谵妄(Bishara 等人,2022 年)和糖尿病(Zou 等人,2022 年)。
To our knowledge, several studies on ML techniques for predicting PPD have been conducted, and a literature review summarizing the results of the studies involved was published recently (Cellini et al., 2022). However, the literature review was not comprehensive because the authors only considered studies relying on clinical structural data, ignoring the studies using social media textual or vocalization data. Additionally, the authors of the literature review did not evaluate the quality of the included studies, meaning that the risks of bias and the clinical applicability of the existing models remain unclear, raising concerns about the feasibility of translating research findings into practice. Therefore, considering the shortcomings mentioned above, in this study, we aimed to comprehensively summarize the PPD prediction studies using ML technology based on multiple data types, evaluate the process quality of model development and validation according to the Prediction Model Risk of Bias Assessment Tool (PROBAST), and to extract the risk predictors of PPD with high frequency in the included studies, providing methodological support for subsequent studies that can ensure the reasonable application of ML methods in predicting the risk of disease.
据我们所知,已经进行了几项关于用于预测 PPD 的 ML 技术的研究,并且最近发表了一篇总结相关研究结果的文献综述(Cellini 等人,2022 年)。然而,文献综述并不全面,因为作者只考虑了依赖于临床结构数据的研究,而忽略了使用社交媒体文本或发声数据的研究。此外,文献综述的作者没有评估纳入研究的质量,这意味着现有模型的偏倚风险和临床适用性仍不清楚,这引发了对将研究结果转化为实践的可行性的担忧。因此,考虑到上述不足,本研究旨在基于多种数据类型,全面总结使用 ML 技术的 PPD 预测研究,根据预测模型偏倚风险评估工具 (PROBAST) 评估模型开发和验证的过程质量,并提取纳入研究中高频的 PPD 风险预测因子, 为后续研究提供方法学支持,确保 ML 方法在预测疾病风险方面的合理应用。

2. Method  2. 方法

This systematic review was conducted in accordance with the Joanna Briggs Institute guidelines for systematic reviews. To facilitate reproducible reporting, our results were presented according to the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines (Page et al., 2021).
系统评价是根据乔安娜布里格斯研究所系统评价指南进行的。为了促进可重复的报告,我们的结果根据系统评价和荟萃分析的首选报告项目 (PRISMA) 指南(Page et al., 2021)呈现。

2.1. Search strategy  2.1. 检索策略

We searched eight health science databases: PubMed, EMBASE, Web of Science, the Cochrane Library, China National Knowledge Infrastructure (CNKI), the Wanfang database, the China Science and Technology Journal Database (VIP), and the China Biology Medicine disc (CBM). Searches aimed to identify studies involving the use of ML-based models to predict PPD risk from the date of inception to March 2022. Additionally, we reviewed the reference lists of each study to find more relevant studies. The search strategy was formed using Medical Subject Headings (MeSH) terms and free words, as follows: (“Depression, Postpartum” OR “Postpartum Depression” OR “Post-partum Depression” OR “Postnatal depression” OR “Post-natal Depression”) AND (“Prediction Model” OR “Risk Model” OR “Risk Prediction Model” OR “Risk Stratification Model”) AND (“Machine Learning” OR “Logistic Regression” OR “Decision Trees” OR “Bayesian Network” OR “Naive Bayes” OR “K Nearest Neighbor” OR “Random Forest” OR “Support Vector Machine” OR “Artificial Neural Network” OR “AdaBoost” OR “XGBoost” OR “Gradient Boosting Machine” OR “Deep learning”).
我们检索了八个健康科学数据库:PubMed、EMBASE、Web of Science、Cochrane Library、中国知网(China National Knowledge Infrastructure, CNKI)、万方数据库、中国科技期刊数据库(China Science and Technology Journal Database, VIP)和中国生物医学光盘(China Biology Medicine disc, CBM)。检索旨在确定涉及使用基于 ML 的模型预测从建库之日到 2022 年 3 月的 PPD 风险的研究。此外,我们回顾了每项研究的参考文献列表,以找到更相关的研究。检索策略使用医学主题词 (MeSH) 检索词和自由词形成,如下所示:(“抑郁症,产后” OR “产后抑郁症” OR “产后抑郁症” OR “产后抑郁症” OR “产后抑郁症”) AND (“预测模型” OR “风险模型” OR “风险预测模型” OR “风险分层模型”) AND (“机器学习” OR “逻辑回归” OR “决策树” OR “贝叶斯网络” OR “朴素贝叶斯” OR “K 最近邻” OR “随机森林” OR “支持”向量机“或”人工神经网络“或”AdaBoost“或”XGBoost“或”梯度提升机“或”深度学习”)。

2.2. Inclusion and exclusion criteria
2.2. 纳入和排除标准

The studies included were those that (1) aimed to identify patients having symptoms of or at the risk of PPD, (2) developed or validated models for predicting PPD based on ML techniques that could be found in the Scikit Learn package or were declared by the authors to belong to the ML domain, (3) reported the performance of the proposed ML-based models, and (4) original studies published through a peer-reviewed process in English and Chinese. We excluded studies that (1) lacked available full text, (2) were conference abstracts, reviews, systematic reviews, or meta-analyses, and (3) involved animal experiments.
纳入的研究是 (1) 旨在识别有 PPD 症状或有 PPD 风险的患者,(2) 开发或验证了基于 ML 技术的预测 PPD 的模型,这些模型可以在 Scikit Learn 包中找到或被作者声明属于 ML 领域,(3) 报告了拟议的基于 ML 的模型的性能, (4) 通过同行评审过程以英文和中文发表的原始研究。我们排除了以下研究:(1)缺乏可用的全文,(2)是会议摘要、综述、系统评价或 meta 分析,以及(3)涉及动物实验。

2.3. Study selection  2.3. 研究选择

The initial citations and records from the aforementioned databases were imported into the EndNote 20 software. All duplicate studies were then identified and removed by the software. Titles and abstracts of the resulting articles were independently screened by two investigators based on the eligibility criteria. Any discrepancies regarding study inclusion were resolved through discussion or by referral to a third investigator.
将上述数据库中的初始引文和记录导入 EndNote 20 软件。然后,软件会识别并删除所有重复的研究。结果文章的标题和摘要由两名研究者根据资格标准独立筛选。有关研究纳入的任何差异均通过讨论或转介给第三位研究者来解决。

2.4. Data extraction  2.4. 数据提取

Data were extracted from all identified studies using the check list for critical appraisal and data extraction for systematic reviews of prediction modeling studies (CHARMS) (Moons et al., 2014). Variables included study, country, language, participants (data source, sample size, and eligible criteria), outcome (classification methods and measurement time), ML techniques, final predictors, data processing (numeric and categorical variables and missing values), data splitting (training, validation, and testing sets), model optimization (feature selection and hyperparameter value selection), model validation status (internal and/or external), model translation, and model performance metrics. Two investigators extracted the data independently and discrepancies were resolved through discussion, with the help of a third investigator, if necessary.
使用预测建模研究 (CHARMS) 系统评价的严格评估和数据提取清单从所有已确定的研究中提取数据(Moons et al., 2014)。变量包括研究、国家、语言、参与者(数据源、样本量和合格标准)、结果(分类方法和测量时间)、ML 技术、最终预测因子、数据处理(数字和分类变量以及缺失值)、数据拆分(训练、验证和测试集)、模型优化(特征选择和超参数值选择)、模型验证状态(内部和/或外部)、模型翻译、 和模型性能指标。两名研究者独立提取数据,并在必要时在第三名研究者的帮助下通过讨论解决差异。

2.5. Risk of bias and applicability assessment
2.5. 偏倚风险和适用性评估

PROBAST was used to assess the risk of bias of the predictive models presented in the studies (Moons et al., 2019). This tool comprises 20 signaling questions across four domains: (1) participants, (2) predictors, (3) outcome, and (4) analysis. Each domain is rated as “low,” “high,” or “unclear” for the risk of bias. The ratings of the four domains resulted in an overall determination of the risk of bias. Low overall risk of bias was assigned if each domain had low scores. High overall risk of bias was assigned if at least one domain was determined to have a high risk of bias. Unclear overall risk of bias was determined if at least one domain was considered unclear and all other domains had low risk of bias. Similarly, two investigators assessed the quality of the studies involved, and discrepancies were resolved through discussion or through referral to a third investigator.
PROBAST 用于评估研究中提出的预测模型的偏倚风险(Moons et al., 2019)。该工具包含四个领域的 20 个信号问题:(1) 参与者,(2) 预测变量,(3) 结果和 (4) 分析。每个领域的偏倚风险都被评为 “低”、“ 高 ”或 “不清楚”。四个领域的评级导致了偏倚风险的总体确定。如果每个领域的分数都较低,则分配较低的总体偏倚风险。如果确定至少一个领域具有高偏倚风险,则分配高总体偏倚风险。如果至少有一个领域被认为不明确,而所有其他领域的偏倚风险都较低,则确定不明确的总体偏倚风险。同样,两名研究者评估了所涉研究的质量,并通过讨论或转介给第三名研究者来解决差异。

2.6. Data synthesis  2.6. 数据合成

To synthesize the studies included, a descriptive analysis was performed in the form of a narrative report. It is important to note that quantitative analyses, such as meta-analysis, could not be performed owing to the significantly inconsistent performance metrics reported across the studies.
为了综合纳入的研究,以叙述性报告的形式进行了描述性分析。需要注意的是,由于研究中报告的实施指标明显不一致,因此无法进行定量分析 ,例如荟萃分析。

3. Results  3. 结果

3.1. Summary of the studies
3.1. 研究总结

A total of 771 articles were identified through database searching (n = 764) and the tracing of additional references (n = 7). Next, 43 duplicates were removed, 700 were filtered based on title and abstract, and 11 were excluded based on the availability of full text. Finally, 17 studies were included for bias and applicability risk assessment and data synthesis (Amit et al., 2021; Andersson et al., 2021; Fang, 2019; Fatima et al., 2019; Gabrieli et al., 2020; Hochman et al., 2021; Jimenez-Serrano et al., 2015; Natarajan et al., 2017; Park et al., 2021; Payne et al., 2020; Shatte et al., 2020; Shin et al., 2020; Tortajada et al., 2009; S. Wang et al., 2019; Xiao et al., 2020; Zhang et al., 2020; Zhang et al., 2021). The screening process is presented in Fig. 1.
通过数据库检索 (n = 764) 和追踪其他参考文献 (n = 7) 共确定了 771 篇文章。接下来,删除了 43 个重复项,根据标题和摘要过滤了 700 个,根据全文的可用性排除了 11 个。最后,纳入了 17 项研究进行偏倚和适用性风险评估以及数据综合(Amit et al., 2021;Andersson 等人,2021 年;Fang, 2019;Fatima et al., 2019;Gabrieli et al., 2020;Hochman 等人,2021 年;Jimenez-Serrano 等人,2015 年;Natarajan et al., 2017;Park 等人,2021 年;Payne et al., 2020;Shatte 等人,2020 年;Shin et al., 2020;Tortajada et al., 2009;S. Wang et al., 2019;Xiao et al., 2020;Zhang et al., 2020;Zhang et al., 2021)。筛选过程如图 1 所示。
Fig. 1
  1. Download: Download high-res image (407KB)
    下载: 下载高分辨率图片 (407KB)
  2. Download: Download full-size image
    下载: 下载全尺寸图像

Fig. 1. Article screening and selection from literature search to final sample.
图 1.从文献检索到最终样本的文章筛选和选择。

3.2. General characteristics of the studies
3.2. 研究的一般特征

3.2.1. Publication status
3.2.1. 发布状态

The general characteristics of the 17 studies are listed in Table 1. The studies were published between 2009 and 2021, with a significant increase in publication over the past three years (Fig. 2). Studies were distributed across 10 countries, with six conducted in the United States (35 %) and three in China (17 %) (Fig. 3). Of these, fifteen articles were written in English and two in Chinese.
表 1 列出了 17 项研究的一般特征。这些研究于 2009 年至 2021 年间发表,在过去三年中发表量显著增加( 图 2)。研究分布在 10 个国家/地区,其中 6 项在美国进行 (35%),3 项在中国进行 (17%)( 图 3)。其中,15 篇文章是英文写的,2 篇是中文写的。

Table 1. Characteristics of the included studies (N = 17).
表 1.纳入研究的特征 (N = 17)。

Study  研究Country; language  国家;语言Participants  参与者Outcome (PPD)  结果 (PPD)ML methods  ML 方法Final predictors
Data source  数据源Eligible criteria  资格条件Classification  分类Measurement time  测量时间
Amit et al., 2021  Amit 等人,2021 年UK; English  英国;英语An EHR dataset; n = 266,544
EHR 数据集;n = 266,544
Women:  女人:
  • (1)
    18–45 years old  18 - 45 岁
  • (2)
    Gave the first live birth
    生了第一个活产
  • (3)
    Had active medical records (at least one recorded diagnosis, one drug prescription and one lab test were recorded during the period of the pregnancy, the pre- ceding 2 years and the year following deliver)
    有有效的医疗记录(在怀孕期间、前 2 年和分娩后一年至少记录了一次有记录的诊断、一次药物处方和一次实验室测试)
  • (4)
    Non-ambiguous outcome  无歧义的结果
  • (5)
    Had no records of antidepressant prescriptions both before (within 1 year before the pregnancy) and after giving birth
    在怀孕前 1 年内和分娩后都没有抗抑郁药处方的记录
One of the following:   以下选项之一:
  • (1)
    Diagnosis of depression  抑郁症的诊断
  • (2)
    New treatment with antidepressant drug
    抗抑郁药的新疗法
  • (3)
    Non-pharmacological treatment for depression
    抑郁症的非药物治疗
1 year postpartum  产后 1 年GBMic features
Clinical
Obstetric
Zhang et al., 2021  Zhang et al., 2021USA; English  美国;英语Two EHR datasets; n1 = 15,197, n2 = 53,972
两个 EHR 数据集;n1 = 15,197,n2 = 53,972
Women:  女人:
  • (1)
    18–45 years old  18 - 45 岁
  • (2)
    With fully completed antenatal care procedures
    完全完成产前保健程序
  • (3)
    Had a live birth  活产
  • (1)
    Systematized Nomenclature of Medicine codes
    系统化的医学代码命名法
  • (2)
    The use of antidepressants within 1 year following childbirth
    产后 1 年内使用抗抑郁药
1 year postpartum  产后 1 年LR  LR 的
RF  射频
DT  德语
XGBoost
MLP
Demographic
Psychosocial
Clinical
Obstetric
Hochman et al., 2021  Hochman 等人,2021 年Israel; English  以色列;英语An EHR dataset; n = 214,359
EHR 数据集;n = 214,359
Women:  女人:
  • (1)
    Without a diagnosis of depression or antidepressant prescription during the 6 months before gestation or during pregnancy
    妊娠前 6 个月或怀孕期间未诊断为抑郁症或开具抗抑郁药处方
  • (2)
    Covered by continuous CHS insurance for at least one year before the pregnancy start date
    在怀孕开始日期前至少一年内连续购买 CHS 保险
  • (3)
    Singleton births  单胎出生
  • (4)
    No stillbirths and infant death within the first year following delivery
    分娩后第一年内无死产和婴儿死亡
  • (1)
    ICD-9/10 codes  ICD-9/10 代码
  • (2)
    New prescription for at least 1 defined daily dose of antidepressant
    每天至少 1 剂确定剂量的抗抑郁药的新处方
1 year postpartum  产后 1 年XGBoostDemographic
Obstetric
Biological
Park et al., 2021  Park 等人,2021 年USA; English  美国;英语IBM MarketScan Medicaid Database; n = 573,634
IBM MarketScan 医疗补助数据库;n = 573,634
Women:  女人:
  • (1)
    12–55 years old  12 - 55 岁
  • (2)
    Had a live birth  活产
  • (3)
    More than 80 % registrations were completed during pregnancy plus 60 days postdelivery
    超过 80% 的登记是在怀孕期间和分娩后 60 天完成的
  • (4)
    With mental health coverage
    有心理健康保险
  • (5)
    White and black  白色和黑色
ICD-9/10 or AD prescription
ICD-9/10 或 AD 处方
60 days postpartum  产后 60 天LR  LR 的
RF  射频
XGBoost
Demographic
Clinical
Obstetric
Health care utilization
Andersson et al., 2021  Andersson 等人,2021 年Sweden; English  瑞典;英语A prospectively cohort study; n = 4313
一项前瞻性队列研究;n = 4313
Women:  女人:
  • (1)
    Aged 18 years of age or older
    年满 18 周岁
  • (2)
    Had sufficient ability to read and understand Swedish
    有足够的瑞典语阅读和理解能力
  • (3)
    Did not have known bloodborne infections and/or non-viable pregnancy as diagnosed by routine ultrasound
    没有通过常规超声诊断的已知血源性感染和/或无法存活的妊娠
EPDS≥126 weeks postpartum  产后 6 周RR
LASSO
GBM
DRF
XRT
NB
SE
Clinical
Psychosocial
Gabrieli et al., 2020  Gabrieli 等人,2020 年Singapore; English  新加坡;英语A cry vocalizations dataset used in a previous prospectively cohort study; n = 56
先前前瞻性队列研究中使用的哭泣发声数据集;n = 56
Women:  女人:
  • (1)
    Had 5-month-old infants  有 5 个月大的婴儿
  • (2)
    Were recorded audio and video for at least 50 min
    录制音频和视频至少 50 分钟
  • (1)
    BDI-II >12
  • (2)
    SCID-I  SCID-I 型
5 months postpartum  产后 5 个月C-MLAcoustic
Demographic
Shatte et al., 2020  Shatte 等人,2020 年Australia; English  澳大利亚;英语Reddit text posts; n = 365
Reddit 文本帖子;n = 365
Fathers who reported birth events on the forum Reddit.
在论坛 Reddit 上报告出生事件的父亲。
Postpartum changes in fathers' use of depression symptom terms
父亲对抑郁症状术语使用的产后变化
6 weeks postpartum  产后 6 周SVMLinguistic features (behavior, emotion, linguistic style, and discussion topics)
Zhang et al., 2020  Zhang et al., 2020China; English  中国;英语A prospectively cohort study; n = 508
一项前瞻性队列研究;n = 508
Women who were over than 18 years old and <13 gestation weeks
18 岁以上且妊娠 <13 周的女性
EPDS≥9.56 weeks postpartum  产后 6 周E-RF
F-RF
E-SVM
F-SVM
Demographic
Clinical
Psychosocial
Xiao et al., 2020  Xiao et al., 2020China; Chinese  中国;中文A prospectively cohort study; n = 406
一项前瞻性队列研究;n = 406
Women:  女人:
  • (1)
    Were in early trimester  处于妊娠早期
  • (2)
    Had a single pregnancy  有一次怀孕
  • (3)
    With normal thyroid and serum lipids
    甲状腺和血脂正常
  • (4)
    Understand and communicate normally
    正常理解和沟通
EPDS6 weeks postpartum  产后 6 周RFDemographic
Psychosocial
Obstetric
Biological
Shin et al., 2020  Shin et al., 2020USA; English  美国;英语the Pregnancy Risk Assessment Monitoring System data; n = 28,755
妊娠风险评估监测系统数据;n = 28,755
Women who gave a live birth
活产的妇女
PHQ-2 (at least 1 positive answer)
PHQ-2 (至少 1 个肯定答案)
1 year postpartum  产后 1 年RF  射频
SVM
GBM
AdaBoost
NB  
RPART  RPART 系列
KNN
LR  LR 的
NNET
Demographic
Psychosocial
Clinical
Obstetric
Payne et al., 2020  Payne 等人,2020 年USA; English  美国;英语Four prospectively collected cohorts data; n1 = 51, n2 = 53, n3 = 113, n4 = 68
四个前瞻性收集的队列数据;n1 = 51,n2 = 53,n3 = 113,n4 = 68
No information  没有信息EPDS≥134–6 weeks postpartum  产后 4-6 周SVMPsychosocial
Biological
; S. Wang et al., 2019
;S. Wang 等人,2019
USA; English  美国;英语A EHR dataset; n = 9980
EHR 数据集;n = 9980
Women:  女人:
  • (1)
    With a fully completed antenatal care procedure at the hospital
    在医院完成完整的产前保健程序
  • (2)
    Had a singleton birth  单胎出生
  • (1)
    The Statistics Canada  加拿大统计局
  • (2)
    ICD-9/10 codes  ICD-9/10 代码
1 year postpartum  产后 1 年SVM
LR  LR 的
RF  射频
NB  
XGBoost
DT  德语
Demographic
Psychosocial
Clinical
Fatima et al., 2019  Fatima 等人,2019 年Saudi Arabia; English  沙特阿拉伯;英语Reddit text posts; n = 21
Reddit 文本帖子;n = 21
Women:  女人:
  • (1)
    Had a title and contained text in the body of the post
    在帖子正文中具有标题且包含文本
  • (2)
    Were not only based on images, videos, and links
    不仅基于图像、视频和链接
Linguistic feature  语言特征LR
SVM
MLP
Linguistic feature
Fang, 2019  方, 2019China; Chinese  中国;中文A cross-sectional data; n = 2396
横截面数据;n = 2396
Women:  女人:
  • (1)
    With a fully completed antenatal care procedure at the hospital
    在医院完成完整的产前保健程序
  • (2)
    Had no history of mental disorders, stillbirths, infant death and any major pressure events
    无精神障碍、死产、婴儿死亡和任何重大压力事件的病史
EPDS ≥104–6 weeks postpartum  产后 4-6 周BN
NB
DT
RF
ANN
SVM
LR
Demographic
Psychosocial
Obstetric
Natarajan et al., 2017  Natarajan et al., 2017USA; English  美国;英语Facebook and Twitter survey data; n=173Facebook and Twitter
Facebook 和 Twitter 调查数据;n=173Facebook 和 Twitter

survey data; n = 173  调查数据;n = 173
Women:  女人:
  • (1)
    18 years or older  年满 18 岁
  • (2)
    Had a child less than one year old
    有一个不到一岁的孩子
PHQ-2 (at least 1 positive answer)
PHQ-2 (至少 1 个肯定答案)
1 year postpartum  产后 1 年FGB
NB  
DT  德语
SVM
AdaBoost
Bagging  装袋
LR  LR 的
Demographic
Psychosocial
Pediatric
Jimenez-Serrano et al., 2015
Jimenez-Serrano 等人,2015 年
Spanish; English  西班牙语;英语A prospectively cohort study; n = 1397
一项前瞻性队列研究;n = 1397
Women:  女人:
  • (1)
    White  
  • (2)
    Had no psychiatric treatment during pregnancy
    怀孕期间没有接受过精神病治疗
  • (3)
    Had no infant death after delivery
    分娩后无婴儿死亡
  • (4)
    Understand and communicate normally
    正常理解和沟通
Probable cases (EPDS≥9) were evaluated using the Spanish version of the Diagnostic Interview for Genetics Studies by clinical psychologists.
临床心理学家使用西班牙语版的遗传学研究诊断访谈 (EPDS≥9) 对可能病例 (EPDS9) 进行评估。
1 week postpartum  产后 1 周NB
LR
SVM
ANN
Demographic
Psychosocial
Obstetric
Tortajada et al., 2009  Tortajada 等人,2009 年Spanish; English  西班牙语;英语A prospectively cohort study; n = 1397
一项前瞻性队列研究;n = 1397
Women:  女人:
  • (1)
    White  
  • (2)
    Had no psychiatric treatment during pregnancy
    怀孕期间没有接受过精神病治疗
  • (3)
    Had no infant death after delivery
    分娩后无婴儿死亡
  • (4)
    Understand and communicate normally
    正常理解和沟通
Probable cases (EPDS≥9) were evaluated using the Spanish version of the Diagnostic Interview for Genetics Studies by clinical psychologists.
临床心理学家使用西班牙语版的遗传学研究诊断访谈 (EPDS≥9) 对可能病例 (EPDS9) 进行评估。
32 weeks postpartum  产后 32 周ANNDemographic
Psychosocial
Obstetric

Biological
Notes: AD: antidepressant; ANN: artificial neural network; AdaBoost: adaptive boosting; BDI-II: Beck Depression Inventory-Second Edition; BN: Bayesian network; C-ML: cloud-based machine learning; DT: decision trees; DRF: distributed random forest; EHR: electronic health records; EPDS: Edinburgh Postnatal Depression Scale; E-RF: model built using random forest algorithm and expert consultation; E-SVM: model built using support vector machine algorithm and expert consultation; F-RF: model built using random forest algorithm and random forest-based filter feature selection method; F-SVM: model built using support vector machine algorithm and random forest-based filter feature selection method; FGB: functional gradient boosting; GBM: gradient boosting machine; ICD: International Classification of Diseases; KNN: K-nearest neighbor; LR: logistic regression; LASSO: least absolute shrinkage and selection operator; ML: machine learning; MLP: multilayer perceptron; NB: naïve Bayes; NNET: neural network; PPD: post-partum depression; PHQ-2: Patient Health Questionnaire 2; RF: random forest; RPART: recursive partitioning and regression trees; RR: ridge regression; SCID-I: Structured Clinical Interview for DSM-IV Axis I Disorders; SVM: support vector machine; SE: Stacked Ensemble; XGBoost: Extreme Gradient Boosting; XRT: extreme randomized forests.
备注: AD:抗抑郁药;ANN: 人工神经网络;AdaBoost:自适应提升;BDI-II:贝克抑郁量表-第二版;BN:贝叶斯网络;C-ML:基于云的机器学习;DT:决策树;DRF: 分布式随机森林;EHR:电子健康记录;EPDS:爱丁堡产后抑郁量表;E-RF:使用随机森林算法构建模型,并进行专家咨询;E-SVM:使用支持向量机算法和专家咨询构建的模型;F-RF:使用随机森林算法和基于随机森林的滤波器特征选择方法构建的模型;F-SVM:使用支持向量机算法和基于随机森林的过滤特征选择方法构建的模型;FGB: 功能梯度提升;GBM: 梯度提升机;ICD:国际疾病分类;KNN:K-最近邻;LR: logistic 回归;LASSO:最小绝对收缩和选择运算符;ML:机器学习;MLP: 多层感知器;注意:朴素贝叶斯;NNET: 神经网络;PPD : 产后抑郁;PHQ-2:患者健康问卷 2;RF: 随机森林;RPART: 递归分区和回归树;RR: ridge 回归;SCID-I:DSM-IV 轴 I 障碍的结构化临床访谈;SVM: 支持向量机;SE:堆叠集成;XGBoost:极端梯度提升;XRT:极端随机森林。
Fig. 2
  1. Download: Download high-res image (52KB)
    下载: 下载高分辨率图像 (52KB)
  2. Download: Download full-size image
    下载: 下载全尺寸图像

Fig. 2. The line chart of publication year of included studies.
图 2.纳入研究的出版年份的折线图。

Fig. 3
  1. Download: Download high-res image (55KB)
    下载: 下载高分辨率图片 (55KB)
  2. Download: Download full-size image
    下载: 下载全尺寸图像

Fig. 3. The country distribution of included studies.
图 3.纳入研究的国家分布。

3.2.2. Data source and sample size
3.2.2. 数据源和样本量

Across all the studies, there were three data sources, including electronic health records (EHRs) or medical system databases (6/17, 35 %), clinical research data (9/17, 53 %), and social media data (2/17, 12 %). The total sample size for each data source ranged from 9980 to 69,169, 56 to 4313, and 21 to 365. Notably, Zhang et al. (2021) and S. Wang et al. (2019) used the same data sources (S. Wang et al., 2019; Zhang et al., 2021), as did Jimenez-Serrano et al. (2015) and Tortajada et al. (2009) (Jimenez-Serrano et al., 2015; Tortajada et al., 2009). Of the studies, sixteen had females as the target population and one had males. The eligibility criteria of the participants were heterogeneous in terms of age, race, psychiatric and adverse obstetric history, medical diagnosis, and parity. All studies clearly defined their study populations and data sources; however, none justified their sample size.
在所有研究中,有三个数据源,包括电子健康记录 (EHR) 或医疗系统数据库 (6/17, 35 %)、 临床研究数据 (9/17, 53 %) 和社交媒体数据 (2/17, 12 %)。每个数据源的总样本量范围为 9980 到 69,169、56 到 4313 和 21 到 365。值得注意的是,Zhang et al. (2021)S. Wang et al. (2019) 使用相同的数据源(S. Wang et al., 2019;Zhang 等人,2021 年),Jimenez-Serrano 等人(2015 年)Tortajada 等人(2009 年)(Jimenez-Serrano 等人,2015 年;Tortajada et al., 2009)。在这些研究中,16 项以女性为目标人群,1 项以男性为目标人群。参与者的资格标准在年龄、种族、精神和不良产科病史、医学诊断和胎次方面存在异质性。所有研究都明确定义了他们的研究人群和数据来源;然而,没有人证明他们的样本量是合理的。

3.2.3. Outcome measurement
3.2.3. 结局测量

Among the 17 studies, the outcome variables for predicting PPD risk were mainly identified using the following aspects: (1) the specific diagnosis, ICD-9/10 codes, or antidepressant prescriptions for PPD in the medical records (5/17, 29 %); (2) PPD screening scales, such as the Edinburgh Postnatal Depression Scale (EPDS, 5/17, 29 %), Beck Depression Inventory-II (BDI-II, 1/17, 6 %), and the Patient Health Questionnaire-2 (PHQ-2, 2/17, 12 %); (3) further assessments in the form of psychiatrists' clinical interviews for probable cases screened using the aforementioned scales (2/17, 12 %); and (4) other informal methods, such as the linguistic features of textual posts (2/17, 12 %). Six studies (35 %) measured the risk of PPD among patients one year after the postpartum period, six (35 %) at six weeks of the postpartum period, one (6 %) at one week into the postpartum period, one (6 %) at 32 weeks after the postpartum period, and the remaining three (17 %) at 60 days, five months, and an unknown time period during or after the postpartum period, respectively.
在 17 项研究中,预测 PPD 风险的结局变量主要从以下几个方面确定: (1) 病历中 PPD 的具体诊断、ICD-9/10 代码或抗抑郁药处方 (5/17, 29 %);(2) PPD 筛查量表,如爱丁堡产后抑郁量表 (EPDS, 5/17, 29 %)、贝克抑郁量表-II (BDI-II, 1/17, 6 %) 和患者健康问卷-2 (PHQ-2, 2/17, 12 %);(3) 以精神科医生临床访谈的形式对使用上述量表筛选的可能病例进行进一步评估 (2/17, 12 %);(4) 其他非正式方法,例如文本帖子的语言特征 (2/17, 12 %)。六项研究 (35%) 测量了产后一年患者患 PPD 的风险,六项研究 (35%) 在产后六周,一项 (6%) 在产后一周,一项 (6%) 在产后 32 周,其余三项研究 (17%) 在产后 60 天、五个月和产后或之后的未知时间段, 分别。

3.2.4. ML-based prediction model
3.2.4. 基于 ML 的预测模型

In total, the 17 studies used 21 types of ML techniques to establish 62 ML-based models for predicting PPD risk. Five studies (29 %) used only one ML technique, whereas the remaining 12 studies (71 %) used three to nine ML techniques. As shown in Fig. 4, the most commonly used ML techniques were support vector machine (SVM, Fre = 10), followed by random forest (RF, Fre = 8) and logistic regression (LR, Fre = 8). The other ML techniques included Naïve Bayes (NB, Fre = 6), artificial neural networks (ANNs, Fre = 5), decision trees (DT, Fre = 4), and eXtreme Gradient Boosting (XGBoost, Fre = 4) among others.
总共有 17 项研究使用了 21 种类型的 ML 技术来建立 62 种基于 ML 的模型来预测 PPD 风险。5 项研究(29%)只使用了一种 ML 技术,而其余 12 项研究(71%)使用了 3 至 9 种 ML 技术。如图 4 所示,最常用的 ML 技术是支持向量机 (SVM, Fre = 10),其次是随机森林 (RF, Fre = 8) 和逻辑回归 (LR, Fre = 8)。其他 ML 技术包括朴素贝叶斯 (NB, Fre = 6)、人工神经网络 (ANN, Fre = 5)、决策树 (DT, Fre = 4) 和 eXtreme Gradient Boosting (XGBoost, Fre = 4) 等。
Fig. 4
  1. Download: Download high-res image (82KB)
    下载: 下载高分辨率图像 (82KB)
  2. Download: Download full-size image
    下载: 下载全尺寸图像

Fig. 4. The frequency of ML techniques used in predicting PPD risk across included studies.
图 4.纳入研究中用于预测 PPD 风险的 ML 技术的频率。

Notes: SVM: support vector machine. RF: random forest. LR: logistic regression. NB: naïve Bayesian. ANN: artificial neural network. DT: decision trees. XGBoost: eXtreme Gradient Boosting. GBM: gradient boosting trees. AdaBoost: adaptive boosting. FGB: functional gradient boosting. RR: ridge regression. DRF: distributed random forest. XRT: extreme randomized forest. SE: Stacked Ensemble. LASSO: least absolute shrinkage and selection operator. BN: Bayesian network. KNN: K-nearest neighbor. RPART: recursive partitioning. NNET: NEURAL network. C-ML: cloud-based machine learning.
注意: SVM:支持向量机。RF:随机森林。LR:逻辑回归。 注: 朴素贝叶斯。ANN:人工神经网络。DT: 决策。XGBoost:eXtreme Gradient Boosting。GBM:梯度提升树。AdaBoost:自适应提升。FGB:功能梯度提升。RR: 岭回归。DRF:分布式随机森林。XRT:极端随机森林。SE:堆叠集成。LASSO:最小绝对收缩率和选择运算符。BN:贝叶斯网络。KNN:K 最近邻。RPART: 递归分区 。NNET:神经网络。C-ML:基于云的机器学习。

3.2.5. Predictors  3.2.5. 预测变量

As shown in Table 1, most of the studies included demographic, psychosocial, clinical and obstetric predictors. Four studies (24 %) considered biological predictors, two studies (12 %) focused on the linguistic features of patients' social media posts for prediction purposes, and one study (6 %) used acoustic features obtained from infants' cry vocalizations. We summarized the 18 predictors that were included at least three times across all the studies (Table 2), and found six predictors in the demographic domain (maternal age, ethnicity, education, income, marital status, and smoking), four in the psychosocial domain (depression during pregnancy, pre-pregnancy depression history/diagnosis, anxiety during pregnancy, and stress/stressful life events), four in the obstetric domain (newborn gender, delivery mode, parity, and gestational age at delivery), and four in the clinical domain (pre-pregnancy body mass index [BMI], antidepressant use, thyroid dysfunction, and sleep status).
表 1 所示,大多数研究包括人口统计学、社会心理、临床和产科预测因素。四项研究 (24%) 考虑了生物预测因子,两项研究 (12%) 侧重于患者社交媒体帖子的语言特征以进行预测,一项研究 (6%) 使用从婴儿哭声中获得的声学特征。我们总结了在所有研究中至少纳入 3 次的 18 个预测因子( 表 2),并在人口统计学领域(产妇年龄、种族、教育程度、收入、婚姻状况和吸烟)发现了 6 个预测因子,在社会心理领域(怀孕期间的抑郁、孕前抑郁史/诊断、怀孕期间的焦虑和压力/压力生活事件),在产科领域发现了 4 个预测因子(新生儿性别、 分娩方式、胎次和分娩胎龄),临床领域有 4 项(孕前体重指数 [BMI]、抗抑郁药使用、甲状腺功能障碍和睡眠状态)。

Table 2. Summary of risk predictors for PPD that were included at least three times in all studies.
表 2.在所有研究中至少纳入 3 次的 PPD 风险预测因子总结。

Predictors (frequency ≥ 3)
预测变量(频率≥ 3)
Amit et al., 2021  Amit 等人,2021 年Zhang et al., 2021  Zhang et al., 2021Hochman et al., 2021  Hochman 等人,2021 年Park et al., 2021  Park 等人,2021 年Andersson et al., 2021  Andersson 等人,2021 年Gabrieli et al., 2020  Gabrieli 等人,2020 年Shatte et al., 2020  Shatte 等人,2020 年Zhang et al., 2020  Zhang et al., 2020Xiao et al., 2020  Xiao et al., 2020Shin et al., 2020  Shin et al., 2020Payne et al., 2020  Payne 等人,2020 年S. Wang et al., 2019
S. Wang 等人,2019 年
Fatima et al., 2019  Fatima 等人,2019 年Fang, 2019  方, 2019Natarajan et al., 2017  Natarajan et al., 2017Jimenez-Serrano et al., 2015
Jimenez-Serrano 等人,2015 年
Tortajada et al., 2009  Tortajada 等人,2009 年Total  
Demographic factors  人口因素
Maternal Age  产妇年龄11
Ethnicity/race  民族/种族7
Maternal education  母亲教育6
Income  收入5
Marital status  婚姻状况4
Smoking  吸烟3

Psychosocial factors  社会心理因素
Depression during pregnancy
怀孕期间的抑郁症
10
Pre-pregnancy depression history/diagnosis
孕前抑郁症病史/诊断
6
Anxiety during pregnancy  怀孕期间的焦虑5
Stress/stressful life events
压力/压力大的生活事件
5

Obstetric factors  产科因素
Gender/gender expectation of newborn
新生儿的性别/性别期望
6
Mode of delivery  交货方式4
Parity/primigravida  产次/primigravida4
Gestational age at delivery
分娩胎龄
3

Clinical factors  临床因素
Pre-pregnancy BMI  孕前 BMI4
Antidepressant use  抗抑郁药的使用4
Thyroid dysfunction  甲状腺功能障碍3
Sleep status  睡眠状态3

3.3. Model development and validation process
3.3. 模型开发和验证过程

3.3.1. Study types  3.3.1. 研究类型

The processes for model development and the validation methods used in the studies are listed in Table 3. We grouped the studies into three categories based on how they contributed to ML-based models for predicting PPD risk. These are as follows: (1) Studies that developed and validated ML-based models (2) studies that improved existing ML algorithms to construct models, and (3) studies that proposed innovative ML approaches to construct models. Among the 17 studies, 15 (88 %) were in the first category, one (6 %) was in the second category, and one (6 %) was in the third category.
表 3 列出了模型开发过程和研究中使用的验证方法 。我们根据研究对基于 ML 的模型预测 PPD 风险的贡献,将这些研究分为三类。这些研究如下:(1) 开发和验证基于 ML 的模型的研究 (2) 改进现有 ML 算法以构建模型的研究,以及 (3) 提出创新 ML 方法来构建模型的研究。在这 17 项研究中,15 项 (88 %) 属于第一类,1 项 (6 %) 属于第二类,1 项 (6 %) 属于第三类。

Table 3. The process of model development and validation across all studies.
表 3.所有研究的模型开发和验证过程。

Study  研究Data processing  数据处理Data splitting  数据拆分Model optimization  模型优化Model validationModel translation
Numeric variables  数值变量Categorical variables  分类变量Missing value  缺失值Training set  训练集Validating set  验证集Testing set  测试集Feature selection  功能选择Hyperparameter value selectionInternalExternal
Type 1  类型 1
Amit et al., 2021  Amit 等人,2021 年One-hot encoding  独热编码Mean imputation  均值插补
  • England data: 177,833  英格兰数据: 177,833
  • 1/00–4/10 data: 178,017  1/00–4/10 数据:178,017
  • Random train set: 173,723 (3-fold CV)
    随机火车组:173,723(3 倍 CV)
  • Non-Eng. data: 82,752  非英语数据: 82,752
  • 5/10–12/17 data: 82,568  5/10–12/17 数据:82,568
  • Random test set: 86,862  随机测试集:86,862
Holdout set: 5959  保持集:5959—; 69  —;69Random split validation (pooled 3-fold CV)
  • Geographical validation
  • Temporal validation
  • Holdout test
Zhang et al., 2021  Zhang et al., 2021Normalization  正常化Dummy encoding  虚拟编码Mean imputation  均值插补WCM data (80 %): 12,158 (5-fold CV)
WCM 数据 (80 %):12,158 (5 倍 CV)
WCM data (20 %): 3039
WCM 数据 (20 %): 3039
CDRN set: 53,972  CDRN 集:53,972SFS; 32  SFS;32Grid searchRandom split validation (pooled 5-fold CV)Geographical validation
Hochman et al., 2021  Hochman 等人,2021 年XGBoost automatic handling
XGBoost 自动搬运
2008–2014 data (80 %): 148,023
2008-2014 年数据 (80 %):148,023
2008–2014 data (20 %): 37,006
2008-2014 年数据 (20 %):37,006
2015 Data: 29,330  2015 年数据:29,330SHAP; 156  十八;156Random split validationTemporal validation
Andersson et al., 2021  Andersson 等人,2021 年Normalization  正常化Binary encoding  二进制编码MICE
  • BP dataset: 4277  BP 数据集:4277
  • Combined dataset: 2385  组合数据集:2385
MDI; 29  MDI;29
Gabrieli et al., 2020  Gabrieli 等人,2020 年NANAData augmentation  数据增强80 % data: 1131  80 % 数据: 113110 % data: 141  10 % 数据: 14110 % data: 141  10 % 数据: 141Praat software for voice analysis; 5
用于语音分析的 Praat 软件;5
Random split validation
Shatte et al., 2020  Shatte 等人,2020 年NANA365 (10-fold CV)  365(10 倍 CV)Linear support vector classification; 59
线性支持向量分类;59
10-Fold CV
Zhang et al., 2020  Zhang et al., 202075 % data: 381 (5-fold CV)
75 % 数据:381 (5 倍 CV)
25 % data: 127  25 % 数据: 127
  • (1)
    Expert consultation; 17  专家咨询;17
  • (2)
    RF-based FFS; 7  基于射频的 FFS;7
Random split validation (pooled 5-fold CV)
Xiao et al., 2020  Xiao et al., 202066.5 % data: 270  66.5 % 数据: 27033.5 % data: 136  33.5 % 数据: 136UA; 49  UA;49Bootstrap
Shin et al., 2020  Shin et al., 2020
  • Set1: 13,356 (10-fold CV)
    Set1:13,356(10 倍 CV)
  • Set2: 13,356 (10-fold CV)
    第 2 组:13,356(10 倍 CV)
  • Set3: 13,356 (10-fold CV)
    第 3 组:13,356(10 倍 CV)
Relief  救济
Set1: 99  第 1 组:99
Set2: 86  第 2 组:86
Set3: 95  第 3 组:95
47 common features  47 个常见功能
Random split validation
Payne et al., 2020  Payne 等人,2020 年Cohort 1   队列 1
Data: 51  数据: 51
Cohort 3   队列 3
Data: 113  数据:113
Cohort 4   队列 4
Data: 68  数据:68
—; 3  —;3LOOCVGeographical validation
; S. Wang et al., 2019
;S. Wang 等人,2019
9980 (10-fold CV)  9980 (10 倍 CV)UA; 98  UA;9810-Fold CV
Fatima et al., 2019  Fatima 等人,2019 年NANA67.33 %
Data: 2172 (10-fold CV)  数据:2172(10 倍 CV)
32.67 % data: 1049  32.67 % 数据: 1049LASSO; 20  套索;20
Fang, 2019  方, 2019Normalization  正常化One-hot encoding  独热编码Mean or Mode imputation  均值或众数插补2396 (10-fold CV)  2396 (10 倍 CV)UA; 21  UA;2110-Fold CV
Jimenez-Serrano et al., 2015
Jimenez-Serrano 等人,2015 年
Normalization  正常化Dummy encoding  虚拟编码Mode imputation  模式插补72 % data: 1006  72 % 数据: 10068 % data: 112  8 % 数据: 11220 % data: 279  20 % 数据:279—; 11  —;11Grid searchRandom split validationAndroid application
Tortajada et al., 2009  Tortajada 等人,2009 年Normalization  正常化Dummy encoding  虚拟编码Mode imputation  模式插补72 % data: 1006  72 % 数据: 10068 % data: 112  8 % 数据: 11220 % data: 279  20 % 数据:279Pruning; 16  修剪;16Random split validation

Type 2  类型 2
Park et al., 2021  Park 等人,2021 年
  • 2014–2017 data (5:3:2)  2014–2017 年数据 (5:3:2)
  • 2018 data: deployment set
    2018 年数据:部署集
—; 71  —;71Random split validationTemporal validation

Type 3  类型 3
Natarajan et al., 2017  Natarajan et al., 2017173 (5-fold CV)  173 (5 倍 CV)—; 11  —;115-Fold CVVisualize tree
Notes: “—” means no information; BP: background & pregnancy; CDRN: New York City Clinical Data Research Network data; CV: cross-validation; GBT: gradient boosting trees; LASSO: least absolute shrinkage and selection operator; MICE: multivariate imputation by chained equations; MDI: mean decrease of impurity; NA: not applicable; RF-based FFS: random forest-based filter feature selection; SFS: sequential forward selection; SHAP: Shapley additive explanations; UA: univariate analysis; WCM: Weill Cornell Medicine.
注:“—”表示无信息;BP:背景和怀孕;CDRN:纽约市临床数据研究网络数据;CV:交叉验证;GBT: 梯度提升树;LASSO:最小绝对收缩和选择运算符;MICE:通过链式方程进行多变量插补;MDI: 杂质平均减少;NA:不适用;基于射频的 FFS:基于森林的随机滤波器特征选择;SFS:顺序正向选择;SHAP:Shapley 加法解释;UA: 单因素分析 ;WCM:威尔康奈尔医学院。

3.3.2. Data pre-processing
3.3.2. 数据预处理

Most of the studies (12/17, 71 %) did not fully describe how training data were preprocessed. Approximately 29 % (5/17) of the studies used normalization for the numeric variables. Six studies presented the processing methods for the categorical variables used therein, with 50% (3/6) using one-hot encoding, 33% (2/6) using dummy coding, and 17% (1/6) using binary coding (unknown type). Various techniques, including constant value imputation (5/17; 29 %), multiple imputation (1/17; 6 %), XGBoost automatic handling (1/17; 6 %), and data augmentation (1/17; 6 %), were used to address the problem of missing data. Approximately 53 % (9/17) of the studies were suspected to be complete cases because they did not report missing data.
大多数研究 (12/17, 71%) 没有完全描述训练数据的预处理方式。大约 29% (5/17) 的研究对数字变量进行了归一化。六项研究介绍了其中使用的分类变量的处理方法,其中 50% (3/6) 使用独热编码,33% (2/6) 使用虚拟编码,17% (1/6) 使用二进制编码(未知类型)。各种技术,包括常数值插补 (5/17;29 %)、多重插补 (1/17;6 %)、XGBoost 自动处理 (1/17;6 %) 和数据增强 (1/17;6 %),被用来解决缺失数据的问题。大约 53% (9/17) 的研究被怀疑是完整的病例,因为它们没有报告缺失数据。

3.3.3. Data splitting  3.3.3. 数据切分

To avoid an optimistic estimate for performance, it is imperative to split the datasets into training, validation, and test sets. The statuses for data splitting were reported across all studies, with eight studies (47 %) splitting the data involved into the three sets mentioned above, eight studies (47 %) splitting the data into training and validation sets, and one study (6 %) only had the training set. The sample sizes of the training sets ranged from 51 to 178,017, the validation sets ranged from 113 to 86,862, and the test sets ranged from 68 to 53,972. In our review, the data was split by following ways: (1) In an appropriate proportion or ratio, such as 4:1; (2) based on location or time attributes, such as geographical split and temporal split; (3) using resampling techniques, such as bootstrapping and cross-validation; and (4) based on aspects or attributes obtained from different studies or institutes.
为避免对性能进行乐观估计,必须将数据集拆分为训练集、验证集和测试集。所有研究都报告了数据拆分的状态,其中 8 项研究 (47%) 将涉及的数据分为上述三组,8 项研究 (47%) 将数据分为训练集和验证集,1 项研究 (6%) 只有训练集。训练集的样本量从 51 到 178,017 不等,验证集从 113 到 86,862 不等,测试集从 68 到 53,972 不等。在我们的综述中,数据通过以下方式进行分割:(1) 以适当的比例或比例,例如 4:1;(2) 基于位置或时间属性,例如地理分割和时间分割;(3) 使用重采样技术,例如引导和交叉验证;(4) 基于从不同研究或机构获得的方面或属性。

3.3.4. Model optimization
3.3.4. 模型优化

Among the 17 studies, various approaches for feature selection were reported in 12 (71 %) studies. These approaches included sequential forward selection (SFS), univariate analysis, model-based approaches for measuring variable importance (such as Shapley Additive Explanation [SHAP] and Mean Decrease in Impurity [MDI]), Praat software for speech analysis, linear support vector classification, RF filter-based feature selection, relief (feature-selection) algorithms, least absolute shrinkage and selection operator (LASSO)-based regression algorithms, pruning algorithms, and other non-statistical approaches used in literature reviews and expert consultations, among others. The number of features selected in each study ranged from three to 156. Concerning hyperparameter selection, only two studies reported using the grid search approach to optimize the associated parameters.
在这 17 项研究中,12 项 (71 %) 研究报告了各种特征选择方法。这些方法包括顺序前向选择 (SFS)、 单变量分析 、用于测量变量重要性的基于模型的方法(例如 Shapley 加法解释 [SHAP] 和杂质平均减少 [MDI])、用于语音分析的 Praat 软件、线性支持向量分类、基于 RF 滤波器的特征选择、救济(特征选择)算法、 最小绝对收缩和选择运算符基于 (LASSO) 的回归算法、修剪算法以及文献综述和专家咨询中使用的其他非统计方法等。每项研究选择的特征数量从 3 到 156 个不等。关于超参数选择,只有两项研究报告使用网格搜索方法来优化相关参数。

3.3.5. Model validation  3.3.5. 模型验证

The internal validation of the models was generally conducted through data-splitting and resampling methods, including random split validation, K-Fold cross validation, and bootstrap. External validation refers to the evaluation of a prediction model using data that have not been used in model development, including geographical validation, temporal validation, and domain validation (J. Wang et al., 2019). Of the 17 studies, five (29 %) involved both internal and external validation, and 12 (71 %) lacked external validation. In the studies focusing on model development and internal validation, six used random split validation only, four used random split validation (pooled K-Fold cross validation), four used K-Fold cross validation only (K = 3, 5, 10), two used the bootstrapping approach, one used leave-one-out cross validation (LOOCV), and one used an unknown approach. In the studies involving the external validation of the associated models, one used geographical, temporal, and holdout set validation, two used geographical validation only, and two used temporal validation only.
模型的内部验证通常通过数据拆分和重采样方法进行,包括随机拆分验证、 K-Fold 交叉验证和 bootstrap。外部验证是指使用模型开发中未使用的数据对预测模型进行评估,包括地理验证、时间验证和域验证(J. Wang et al., 2019)。在这 17 项研究中,5 项 (29%) 涉及内部和外部验证,12 项 (71%) 缺乏外部验证。在专注于模型开发和内部验证的研究中,6 项仅使用随机拆分验证,4 项使用随机拆分验证(合并 K-Fold 交叉验证),4 项仅使用 K-Fold 交叉验证 (K = 3、5、10),2 项使用自举方法,1 项使用留一法交叉验证 (LOOCV),1 项使用未知方法。在涉及相关模型外部验证的研究中,1 项使用地理、时间和维持集验证,2 项仅使用地理验证,2 项仅使用时间验证。
Additionally, only two studies attempted to translate the proposed ML-based models into clinical tools using one visual tree learned from a functional gradient boosting machine (FGB) and one mobile application. The studies involving translated ML-based models indicated the intent to test such models in clinical environments in the future.
此外,只有两项研究试图使用从功能梯度提升机 (FGB) 中学习的一棵可视化和一款移动应用程序,将拟议的基于 ML 的模型转化为临床工具。涉及基于 ML 的翻译模型的研究表明, 打算在未来的临床环境中测试此类模型。

3.4. Model performance  3.4. 模型性能

The performance levels of the best ML-based models with different measurement times for predicting the risk of PPD are listed in Table 4. The performance metrics reported in the 17 studies comprised their area under the receiver operating characteristic (AUROC; 15/17, 88 %), Brier (1/17, 6 %), sensitivity or recall (16/17, 94 %), specificity (11/17,65 %), accuracy (10/17, 59 %), precision (7/17, 41 %), positive predictive value (PPV; 5/17, 29 %), negative predictive value (NPV; 5/17, 29 %), and F-scores (3/17, 18%). The ML techniques used to predict the risk of PPD at 6 weeks into the postpartum period included those based on XRT, SVM, and RF, among which SVM had the highest frequency (Fre = 3) and performance (AUROC = 0.84). The ML techniques used to predict the risk of PPD one year into the postpartum period included those based on the GBM, LR, XGBoost, RF, SVM, and FGB, among which the algorithm based on gradient boosting had the highest frequency (Fre = 3) and that based on FGB demonstrated the highest level of performance (AUROC = 0.952).
表 4 列出了具有不同测量时间预测 PPD 风险的最佳基于 ML 的模型的性能水平。17 项研究中报告的性能指标包括它们在受试者工作特征 (AUROC;15/17, 88 %)、Brier (1/17, 6 %)、敏感性或召回率 (16/17, 94 %)、特异性 (11/17,65 %)、准确性 (10/17, 59 %)、精确率 (7/17, 41 %)、阳性预测值 (PPV; 5/17, 29 %)、阴性预测值 (NPV; 5/17, 29 %) 和 F 分数 (3/17, 18%) 下的区域。用于预测产后 6 周 PPD 风险的 ML 技术包括基于 XRT 、 SVM 和 RF 的技术,其中 SVM 的频率 (Fre = 3) 和性能最高 (AUROC = 0.84)。用于预测产后 1 年 PPD 风险的 ML 技术包括基于 GBM 、 LR 、 XGBoost、 RF、SVM 和 FGB 的技术,其中基于梯度提升的算法频率最高 (Fre = 3),基于 FGB 的算法表现出最高水平的性能 (AUROC = 0.952)。

Table 4. Predicting performance for the best model for each study sort by outcome measurement time.
表 4.预测每项研究的最佳模型的性能,按结果测量时间排序。

Outcome measurement time  结果测量时间Validation status  验证状态Best-performing ML model  性能最佳的 ML 模型AUROCBrier score  Brier 评分SEN/REC  特殊教育需要/RECSPEACCPREPPVNPVF-score
1-week postpartum  产后 1 周
Jimenez-Serrano et al., 2015
Jimenez-Serrano 等人,2015 年
INB0.750.720.730.73

6-week postpartum  产后 6 周
Andersson et al., 2021  Andersson 等人,2021 年IXRT0.810.720.750.730.330.94
Shatte et al., 2020  Shatte 等人,2020 年ISVM0.680.660.670.67
Zhang et al., 2020  Zhang et al., 2020IF-SVM0.780.690.830.680.84
Xiao et al., 2020  Xiao et al., 2020IRF0.8330.6140.8910.8010.730.828
Payne et al., 2020  Payne 等人,2020 年I; E  我;ESVM0.84
Fang, 2019  方, 2019IBN0.7630.7170.7990.7170.7140.714

60-day postpartum  产后 60 天
Park et al., 2021  Park 等人,2021 年I; E  我;EXGBoost0.7220.6150.7220.783

32-week postpartum  产后 32 周
Tortajada et al., 2009  Tortajada 等人,2009 年IANN0.840.780.850.84

5-month postpartum  产后 5 个月
Gabrieli et al., 2020  Gabrieli 等人,2020 年IC-ML0.9690.8880.8950.904

1-year postpartum  产后 1 年
Amit et al., 2021  Amit 等人,2021 年I; E  我;EGBM0.8440.7640.80
Zhang et al., 2021  Zhang et al., 2021I; E  我;ELR0.8860.1580.800.840.260.98
Hochman et al., 2021  Hochman 等人,2021 年I; E  我;EXGBoost0.7120.3490.9050.0140.9850.776
Shin et al., 2020  Shin et al., 2020IRF0.8840.7320.8650.7910.839
; S. Wang et al., 2019
;S. Wang 等人,2019
ISVM0.790.8940.58
Natarajan et al., 2017  Natarajan et al., 2017IFGB0.9520.840.92

Unknown  未知
Fatima et al., 2019  Fatima 等人,2019 年IMLP0.8680.8690.87
Notes: “—” means no information; ANN: artificial neural network; AUROC: the area under the receiver operating characteristic; Acc: accuracy; BN: Bayesian network; C-ML: cloud-based machine learning model; E: external validation; F-SVM: model built using support vector machine algorithm and random forest-based filter feature selection method; FGB: functional gradient boosting; GBM: gradient boosting machine; I: internal validation; LR: logistic regression; ML: machine learning; MLP: multilayer perceptron; NB: naïve Bayes; NPV: negative predictive value; PRE: precision; PPV: positive predictive value; RF: random forest; Sen: sensitivity; REC: recall; SPE: specificity; SVM: support vector machine; XGBoost: Extreme Gradient Boosting; XRT: extreme randomized forests.
注:“—”表示无信息;ANN: 人工神经网络;AUROC: 接收者工作特性下的区域;Acc:精度;BN:贝叶斯网络;C-ML:基于云的机器学习模型;E: 外部验证;F-SVM:使用支持向量机算法和基于随机森林的过滤特征选择方法构建的模型;FGB: 功能梯度提升;GBM: 梯度提升机;I: 内部验证;LR: logistic 回归;ML:机器学习;MLP: 多层感知器 ;注意:朴素贝叶斯;NPV: 阴性预测值;PRE:精度;PPV: 阳性预测值;RF: 随机森林;Sen:敏感度;REC: 召回;SPE:特异性;SVM: 支持向量机;XGBoost:极端梯度提升;XRT:极端随机森林。
Discrimination and calibration were two crucial aspects to consider when evaluating the prediction performance of ML-based models. The AUROC score can be regarded as a discrimination index of models for predicting dichotomous outcomes. The median AUROC score in five studies focusing on the internal validation of ML-based models was 0.78 (range: 0.763–0.833), and the best AUROC score of a single study on the external validation of ML-based models was 0.84 (SVM) for six-week postpartum depression. In the case of one-year postpartum depression, the median AUROC score was 0.884 (range: 0.79–0.952) in three studies involving the internal validation of ML-based models and 0.844 (range: 0.712–0.89) in three studies involving the external validation of such models. The Brier score is an effective calibration index for ML-based prediction models, and it was reported only by Zhang et al. (2021) at a score of 0.16 (reference value: 0–0.25) for the LR-based ML model for predicting the risk of PPD within a one-year postpartum period.
在评估基于 ML 的模型的预测性能时,判别和校准是需要考虑的两个关键方面。AUROC 评分可被视为预测二分类结果的模型的鉴别指数。五项专注于基于 ML 的模型内部验证的研究的 AUROC 评分中位数为 0.78 (范围:0.763-0.833),而一项关于基于 ML 的模型外部验证的研究的最佳 AUROC 评分为 0.84 (SVM) 产后 6 周抑郁症。在产后一年抑郁症的情况下,在涉及基于 ML 的模型内部验证的三项研究中,AUROC 评分中位数为 0.884(范围:0.79-0.952),在涉及此类模型外部验证的三项研究中,AUROC 评分中位数为 0.844(范围:0.712-0.89)。Brier 评分是基于 ML 的预测模型的有效校准指标,仅由 Zhang 等人(2021 年) 报道,基于 LR 的 ML 模型预测产后一年内 PPD 风险的评分为 0.16(参考值:0-0.25)。
The AUROC scores for the 21 types of ML-based models proposed in all the studies are listed in Table 5. Across all the studies, there were 62 ML-based models with an AUROC score ranging from 0.565 to 0.969. We summarized the AUROC-score ranges for each type of the ML-based models involved in this review. The SVM-based models achieved a score of 0.651–0.864, LR-based models achieved a score of 0.707–0.886, and RF-based models achieved a score of 0.70–0.884. Among them, a total of 20 ML-based models had AUROC scores of >0.8. SVM- (3/20, 15 %) and RF- (3/20, 15 %) based ML models with AUROC scores of >0.8 were the most, followed by ANN-based (2/20, 10 %), DT-based (2/20, 10 %), and GBM-based ML models (2/20, 10 %).
表 5 列出了所有研究中提出的 21 种基于 ML 的模型的 AUROC 分数。在所有研究中,有 62 个基于 ML 的模型,AUROC 评分从 0.565 到 0.969 不等。我们总结了本综述中涉及的每种基于 ML 的模型的 AUROC 评分范围。基于 SVM 的模型得分为 0.651-0.864,基于 LR 的模型得分为 0.707-0.886,基于 RF 的模型得分为 0.70-0.884。其中,共有 20 个基于 ML 的模型的 AUROC 得分为 >0.8。基于 SVM- (3/20, 15 %) 和 RF- (3/20, 15 %)的 ML 模型最高,AUROC 评分为 >0.8,其次是基于 ANN (2/20, 10 %)、基于 DT (2/20, 10 %)和基于 GBM 的 ML 模型 (2/20, 10 %)。

Table 5. The frequency, AUROC range and the number of models with AUROC>0.8 for each type of ML techniques.
表 5.每种类型的 ML 技术的频率、AUROC 范围和具有 AUROC>0.8 的模型数量。

ML techniques  ML 技术Frequency  频率Range of AUROC  AUROC 的范围AUROC > 0.8  奥洛克 > 0.8
SVM100.651–0.8643
LR80.707–0.8861
RF80.70–0.8843
NB60.684–0.7930
ANN/MLP50.66–0.8872
DT40.66–0.9022
GBM30.73–0.8592
XGBoost40.712–0.771
AdaBoost20.784–0.8571
FGB10.9521
Bagging  装袋10.5650
RR10.790
DRF10.801
XRT10.811
SE10.801
LASSO10.801
BN10.7630
KNN10.7760
RPART10.7890
NNET10.7040
C-ML10.9691
Notes: AUROC: the area under the receiver operating characteristic; ANN: artificial neural network; BN: Bayesian network; C-ML: cloud-based machine learning; DT: decision trees; DRF: distributed random forest; FGB: functional gradient boosting; GBM: gradient boosting machine; KNN: K-nearest neighbor; LR: logistic regression; LASSO: least absolute shrinkage and selection operator; ML: machine learning; MLP: multilayer perceptron; NB: naïve Bayes; NNET: neural network; RF: random forest; RPART: recursive partitioning and regression trees; RR: ridge regression; SVM: support vector machine; SE: Stacked Ensemble; XGBoost: Extreme Gradient Boosting; XRT: extreme randomized forests.
注:AUROC:接收器工作特性下的区域;ANN: 人工神经网络 ;BN:贝叶斯网络;C-ML:基于云的机器学习;DT:决策树;DRF: 分布式随机森林;FGB: 功能梯度提升;GBM: 梯度提升机;KNN:K-最近邻;LR: logistic 回归;LASSO:最小绝对收缩和选择运算符;ML:机器学习;MLP: 多层感知器;注意:朴素贝叶斯;NNET: 神经网络;RF: 随机森林;RPART:递归分区和回归树;RR: ridge 回归;SVM: 支持向量机;SE:堆叠集成;XGBoost:极端梯度提升;XRT:极端随机森林。

3.5. Risk of bias and applicability risk
3.5. 偏倚风险和适用性风险

The results associated with the risk of bias and applicability risk among the studies are listed in Table 6. As shown in Fig. 5, all the studies demonstrated a high risk of bias that often originated in the domains of “participants” and “analysis”. Specifically, among the 17 studies, eight demonstrated a high or unclear risk of bias in the participant domain, five reported such bias in the predictor domain, four reported the risk of bias in the outcome domain, and all studies had a high risk of bias in the analysis domain. Issues in the participant domain comprised the use of non-cohort or non-nested case control data and the exclusion of the subgroups of potential predictors. Issues in the predictor domain were related to unclear definitions or assessments made with prior knowledge of the outcome. Issues in the outcome domain were associated with unclear or unappropriated determinations. Issues in the analysis domain involved the clearly flawed handling of missing data, unclear variable transformation, predictor selection based on univariable analysis, and inadequate reporting.
表 6 列出了与研究中偏倚风险和适用性风险相关的结果。如图 5 所示,所有研究都表明存在高偏倚风险,这通常源于 “参与者” 和 “分析” 领域。具体来说,在 17 项研究中,8 项在参与者领域表现出高或不明确的偏倚风险,5 项在预测因子领域报告了这种偏倚,4 项报告了结果领域的偏倚风险,所有研究在分析领域都存在高偏倚风险。参与者域中的问题包括使用非队列或非嵌套病例对照数据以及排除潜在预测因子的亚组。预测因子领域的问题与定义不明确或在事先了解结果的情况下进行的评估有关。结果域中的问题与不明确或未适当的决定有关。分析领域中的问题涉及对缺失数据的处理存在明显缺陷、变量转换不明确、基于单变量分析的预测变量选择以及报告不充分。

Table 6. Results of bias and applicability risk assessment according to PROBAST.
表 6.根据 PROBAST 的偏倚和适用性风险评估结果。

Study  研究Risk of bias  偏倚风险Applicability  适用性Overall  整体
Participants  参与者Predictors  预测Outcome  结果Analysis  分析Participants  参与者Predictors  预测Outcome  结果ROBROA
Amit et al., 2021  Amit 等人,2021 年High  Low  Low  High  Low  Low  Low  High  Low  
Zhang et al., 2021  Zhang et al., 2021Low  Low  Low  High  Low  Low  Low  High  Low  
Hochman et al., 2021  Hochman 等人,2021 年High  Low  Low  High  High  Low  Low  High  High  
Park et al., 2021  Park 等人,2021 年Low  Low  Low  High  Low  Low  Low  High  Low  
Andersson et al., 2021  Andersson 等人,2021 年Low  Low  Low  High  Low  Low  Low  High  Low  
Gabrieli et al., 2020  Gabrieli 等人,2020 年Low  Low  Low  High  Low  High  Low  High  High  
Shatte et al., 2020  Shatte 等人,2020 年Low  Low  High  High  Low  High  High  High  High  
Zhang et al., 2020  Zhang et al., 2020Low  Low  Low  High  Low  Low  Low  High  Low  
Xiao et al., 2020  Xiao et al., 2020High  Low  Low  High  High  Low  Low  High  High  
Shin et al., 2020  Shin et al., 2020Low  Unclear  清楚High  High  Low  Low  Low  High  Low  
Payne et al., 2020  Payne 等人,2020 年Unclear  清楚Unclear  清楚Low  High  High  High  Low  High  High  
; S. Wang et al., 2019
;S. Wang 等人,2019
Low  Unclear  清楚Low  High  Low  Low  Low  High  Low  
Fatima et al., 2019  Fatima 等人,2019 年High  High  High  High  Low  High  High  High  High  
Fang, 2019  方, 2019High  Low  Low  High  High  Low  Low  High  High  
Natarajan et al., 2017  Natarajan et al., 2017Low  Unclear  清楚High  High  Low  Low  High  High  High  
Jimenez-Serrano et al., 2015
Jimenez-Serrano 等人,2015 年
High  Low  Low  High  High  Low  Low  High  Low  
Tortajada et al., 2009  Tortajada 等人,2009 年High  Low  Low  High  High  Low  Low  High  High  
Notes: Low: Low risk of bias (no signaling questions in the domain answered as indicating risk of bias); High: High risk of bias (1 or more signaling questions in the domain answered as indicating risk of bias); Unclear: Unclear risk of bias (1 or more signaling questions in the domain answered as unclear); ROB: risk of bias; ROA: risk of applicability.
注:低:低偏倚风险(在回答的域中没有信号问题表明偏倚风险);高:高偏倚风险(该域中的 1 个或多个信令问题回答为表明偏倚风险);不清楚:偏倚风险不明确(域中的 1 个或多个信号问题回答不清楚);ROB:偏倚风险;ROA:适用性风险。
Fig. 5
  1. Download: Download high-res image (70KB)
    下载: 下载高分辨率图片 (70KB)
  2. Download: Download full-size image
    下载: 下载全尺寸图像

Fig. 5. The results of bias risk assessment across included studies.
图 5.纳入研究的偏倚风险评估结果。

As shown in Table 6 and Fig. 6, most of the studies (9/17, 53 %) demonstrated high levels of concerns associated with applicability risk. Such concerns often originated from the “participants” and “predictors” domains. Specifically, six studies involved concerns from the “participants” domain, four articles involved concerns from the “predictors” domain, and three articles involved concerns from the “outcomes” domain, which was related to narrow inclusion criteria and unclear definitions for the predictors or outcomes.
表 6图 6 所示,大多数研究 (9/17, 53%) 表明与适用性风险相关的高度担忧。这种担忧通常源于 “参与者” 和 “预测因子” 领域。具体来说,6 项研究涉及 “参与者” 领域的担忧,4 篇文章涉及 “预测因子” 领域的担忧,3 篇文章涉及 “结果” 领域的担忧,这与狭窄的纳入标准和预测因子或结果的定义不明确有关。
Fig. 6
  1. Download: Download high-res image (61KB)
    下载: 下载高分辨率图像 (61KB)
  2. Download: Download full-size image
    下载: 下载全尺寸图像

Fig. 6. The results of applicability risk assessment across included studies.
图 6.纳入研究的适用性风险评估结果。

4. Discussion  4. 讨论

4.1. Summary of findings  4.1. 调查结果总结

We identified 62 ML-based models across 17 studies to predict PPD risk. We then systematically reviewed the development and performance of the ML-based models, after which we assessed the bias and applicability risks associated with all the studies. Research interests in ML techniques for predicting PPD risk have grown significantly, with a substantial increase in the number of publications over the past three years. In addition to predicting the maternal risks of PPD, paternal risks of PPD are gradually being considered. Current research generally focuses on PPD risk within six weeks of or one year after the postpartum period. The ML-based models for predicting PPD risk are mainly constructed using supervised learning algorithms, with multiple predictors involving demographic, psychosocial, obstetric, clinical, and biological variables. Most of the studies were based on the development and validation of ML-based models associated with the prediction PPD risk. A few studies focus on the improvement of ML-based models and the innovation of ML algorithms. The best-performing ML-based models were different in different PPD measurement time, such as SVM and FGB for predicting PPD risk at 6 weeks and 1 year postpartum, respectively. Approximately 1/3 of the ML-based models achieved AUROC scores of <0.8, thereby indicating that the prediction performance of existing models was mediocre. Additionally, few studies translated ML-based models into clinical assessment tools. All the studies demonstrated a high risk of bias and applicability risk owing to non-comprehensive development and validation processes or results reports. Therefore, the current studies demonstrated certain potential and feasibility in predicting PPD risk based on ML techniques. However, there is a long way to go before the research results are translated into implementations that can be applied in clinical practice.
我们在 17 项研究中确定了 62 个基于 ML 的模型来预测 PPD 风险。然后,我们系统地回顾了基于 ML 的模型的开发和性能,之后我们评估了与所有研究相关的偏倚和适用性风险。对用于预测 PPD 风险的 ML 技术的研究兴趣显着增长,在过去三年中出版物数量大幅增加。除了预测产妇患 PPD 的风险外,人们还逐渐考虑父亲患 PPD 的风险。目前的研究通常侧重于产后 6 周内或产后 1 年内的 PPD 风险。用于预测 PPD 风险的基于 ML 的模型主要使用监督学习算法构建,具有多个预测因子,涉及人口统计学、社会心理、产科、临床和生物变量。大多数研究基于与预测 PPD 风险相关的基于 ML 的模型的开发和验证。一些研究侧重于基于 ML 的模型的改进和 ML 算法的创新。表现最佳的基于 ML 的模型在不同的 PPD 测量时间内存在差异,例如预测产后 6 周和 1 年 PPD 风险的 SVM 和 FGB。大约 1/3 的基于 ML 的模型实现了 <0.8 的 AUROC 分数,这表明现有模型的预测性能一般。此外,很少有研究将基于 ML 的模型转化为临床评估工具 。所有研究均表明,由于开发和验证过程或结果报告不全面,存在较高的偏倚风险和适用性风险。因此,目前的研究表明,基于 ML 技术预测 PPD 风险具有一定的潜力和可行性。 然而,在将研究结果转化为可应用于临床实践的实施之前,还有很长的路要走。

4.2. ML techniques applied
4.2. 应用的 ML 技术

According to the summary of ML-based models, the most commonly used ML algorithms for predicting PPD risk are those based on SVM, RF, and LR. The SVM is a supervised ML algorithm or approach that uses minimal input data to define the hyperplane that maximizes the margins between two classes (Jimenez-Serrano et al., 2015). It is suitable for handling multiple features and small-sample problems, and has the advantages of anti-noise, high learning efficiency, and good generalization (Zhang et al., 2020). RF is an ensemble classifier that combines the prediction of multiple decision trees and outputs the class voted by a majority of the trees (Xiao et al., 2020). It is also suitable for high-dimensional feature problems and has the advantages of low sensitivity to missing data, no need for feature selection, simple parameter adjustment process, fast calculation speed, and feature importance output (Zhang et al., 2020). LR is a type of generalized linear regression model, which is suitable for low-dimensional feature sampling and is characterized by fast calculation speed, simple form, and high interpretability (Jimenez-Serrano et al., 2015; S. Wang et al., 2019). However, currently, there remains no absolute statement regarding the type of ML-based model with the best performance. This may be due to the applicability of different ML techniques in different research samples. For example, in a comparative study on SVM and RF-based ML models for predicting PPD risk, those constructed using SVM demonstrated enhanced levels of sensitivity. One possible explanation for this result is that SVM-based ML models can avoid overfitting in a small sample compared to those based on RF (Zhang et al., 2020).
根据基于 ML 的模型的总结,预测 PPD 风险最常用的 ML 算法是基于 SVM、RF 和 LR 的算法。SVM 是一种监督式 ML 算法或方法,它使用最少的输入数据来定义最大化两个类之间边距的超平面(Jimenez-Serrano et al., 2015)。适用于处理多特征和小样本问题,具有抗噪声、学习效率高、泛化性好等优点(Zhang et al., 2020)。RF 是一个集成分类器,它结合了多个决策树的预测,并输出由大多数决策树投票的类(Xiao et al., 2020)。它也适用于高维特征问题,具有对缺失数据敏感度低、无需特征选择、参数调整过程简单、计算速度快、特征重要性输出等优点(Zhang et al., 2020)。LR 是一种广义线性回归模型,适用于低维特征采样,具有计算速度快、形式简单、可解释性高等特点(Jimenez-Serrano et al., 2015;S. Wang et al., 2019)。但是,目前,对于具有最佳性能的基于 ML 的模型类型,仍然没有绝对的声明。这可能是由于不同的 ML 技术在不同的研究样本中的适用性。例如,在对用于预测 PPD 风险的 SVM 和基于 RF 的 ML 模型的比较研究中,使用 SVM 构建的模型表现出更高的灵敏度水平。 对这一结果的一种可能的解释是,与基于 RF 的 ML 模型相比,基于 SVM 的 ML 模型可以避免在小样本中过度拟合(Zhang et al., 2020)。
Additionally, several studies tried to improve or innovate ML techniques for predicting PPD risk. Park et al. constructed PPD risk prediction model using a pre-processing method called reweighing that applied different weights to each group-label combination according to conditional probability of label by race, resulting in a greater reduction in algorithmic bias for PPD between White and Black individuals than simply excluding race from the prediction models (Park et al., 2021). Natarajan et al. adapted an innovation technique FGB that can handle class imbalance in a principled manner, where the gradients are computed as the difference between observed and predicted probabilities of each example and a new regression tree is fitted to these examples at every iteration. This study showed that FGB gained a good performance, with the AUROC, recall and precession of 0.952, 0.84 and 0.92, which was outperform the baseline classifiers (Natarajan et al., 2017).
此外,一些研究试图改进或创新用于预测 PPD 风险的 ML 技术。Park 等人使用一种称为重新加权的预处理方法构建了 PPD 风险预测模型,该方法根据种族标签的条件概率对每个组标签组合应用不同的权重,从而比简单地从预测模型中排除种族更大地减少了白人和黑人个体之间 PPD 的算法偏差(Park 等人, Natarajan 等人采用了一种创新技术 FGB,该技术可以以有原则的方式处理类不平衡,其中梯度计算为每个样本的观察概率和预测概率之间的差异,并在每次迭代时为这些样本拟合新的回归树。这项研究表明,FGB 获得了良好的性能,AUROC、召回率和进动分别为 0.952、0.84 和 0.92,优于基线分类器(Natarajan et al., 2017)。

4.3. Important PPD predictors
4.3. 重要的 PPD 预测因子

Demographic features were the most commonly included predictors, with the highest inclusion frequency being that for maternal age. The predictive value of age on PPD is polarized, whereby women younger than 19 or over 35 years are at greater risk for PPD (Amit et al., 2021; Shin et al., 2020).
人口统计学特征是最常见的预测因素,其中最高的纳入频率是产妇年龄 。年龄对 PPD 的预测价值是两极分化的,即 19 岁以下或 35 岁以上的女性患 PPD 的风险更大(Amit 等人,2021 年;Shin et al., 2020)。
Psychosocial predictors appear to be the most significant predictors for PPD. EHRs or background data combined with the scores of self-reported psychometric scales could increase the AUROC or sensitivity scores during the screening process (Amit et al., 2021; Andersson et al., 2021). The models' performances decreased when the women were stratified based on their history of depression or other psychiatric disorders before pregnancy (Andersson et al., 2021; Hochman et al., 2021). Although Payne et al. demonstrated that the epigenetic PPD biomarker contained in model was highly accurate among women with and those without a psychiatric history, they also emphasized that the antenatal depression status was also crucial in generating accurate predictions of PPD (Payne et al., 2020).
社会心理预测因子似乎是 PPD 最重要的预测因子。EHR 或背景数据与自我报告的心理测量量表的分数相结合, 可能会在筛选过程中增加 AUROC 或敏感性分数(Amit 等人,2021 年;Andersson 等人,2021 年)。当根据女性怀孕前的抑郁症或其他精神疾病史对女性进行分层时,模型的表现会降低(Andersson 等人,2021 年;Hochman et al., 2021)。尽管 Payne 等人证明模型中包含的表观遗传 PPD 生物标志物在有和没有精神病史的女性中高度准确,但他们也强调产前抑郁状态对于生成 PPD 的准确预测也至关重要(Payne 等人,2020 年)。
Obstetric predictors related to PPD risk include the gender of newborns, delivery mode, parity, and gestational age at delivery. The association between the gender of newborns and the risk of PPD varies across countries, such as the preference for male infants in Japan (Mori et al., 2018). As reported in previous studies, the role of parity and delivery mode in predicting PPD risk remains controversial (Baba et al., 2021; Deng et al., 2021). Women who give birth before the gestational period is complete are at a higher risk of PPD (Amit et al., 2021).
与 PPD 风险相关的产科预测因素包括新生儿的性别、分娩方式、胎次和分娩胎龄。新生儿性别与 PPD 风险之间的关联因国家而异,例如日本偏爱男婴(Mori 等人,2018 年)。正如以前的研究所报道的那样,胎次和交付方式在预测 PPD 风险中的作用仍然存在争议(Baba 等人,2021 年; 邓等人,2021 年)。在妊娠期完成之前分娩的妇女患 PPD 的风险更高(Amit 等人,2021 年)。
Clinical predictors associated with PPD comprise physical signs, medication use and disease diagnoses, such as pre-pregnancy BMI, thyroid dysfunction, and the use of antidepressants. Previous studies show that disease diagnoses (vs. demographic or medication) demonstrated the best performance (AUROC = 0.72) when it came to predicting PPD risk and that combining disease diagnoses with medication improved the prediction PPD risk (AUROC = 0.76) compared to using disease diagnoses alone (Shin et al., 2020).
与 PPD 相关的临床预测因素包括体征、药物使用和疾病诊断,例如孕前 BMI 甲状腺功能障碍和抗抑郁药的使用。先前的研究表明,疾病诊断(与人口统计或药物相比)在预测 PPD 风险方面表现出最佳性能 (AUROC = 0.72),并且与单独使用疾病诊断相比,将疾病诊断与药物相结合提高了 PPD 风险的预测 (AUROC = 0.76)(Shin 等人,2020 年)。
Biological predictors include laboratory indicators and epigenetic biomarkers obtained from patients' blood samples, and the latter seem to predict PPD risk better when included in the model. For instance, Hochman et al. considered albumin, ferritin, hematocrit, hemoglobin, and creatinine levels during pregnancy when predicting the risk of PPD, obtained the AUROC, sensitivity and specificity of 0.712, 0.349 and 0.905 at the 90th percentile risk threshold(Hochman et al., 2021). While Payne et al. showed that the evaluation of DNA methylation in HP1BP3 and TTC9B resulted in an accurate prediction of future EPDS scores among patients, with an AUROC of 0.84 (Payne et al., 2020). Tortajada et al. selected combination genotypes (5-HTT-GC) as predictors of PPD risk, obtained the AUROC, sensitivity and specificity of 0.84, 0.78 and 0.85 (Tortajada et al., 2009).
生物预测因子包括从患者血液样本中获得的实验室指标和表观遗传生物标志物,后者在包含在模型中时似乎可以更好地预测 PPD 风险。例如,Hochman 等人在预测 PPD 风险时考虑了怀孕期间的白蛋白、 铁蛋白 、血细胞比容、血红蛋白和肌酐水平,在第 90 个百分位风险阈值处获得了 AUROC、0.349 和 0.905Hochman 等人,2021 年)。而 Payne 等人表明,对 HP1BP3 和 TTC9B 中 DNA 甲基化的评估导致对患者未来 EPDS 评分的准确预测,AUROC 为 0.84Payne 等人,2020 年)。Tortajada 等人选择组合基因型 (5-HTT-GC) 作为 PPD 风险的预测因子,获得了 AUROC,敏感性和特异性为 0.84、0.78 和 0.85 (Tortajada et al., 2009)。
Additionally, unstructured data features also can be used to predict PPD risk, but the prediction effect varies greatly, which may be related to the predicted PPD risk population. For examples, Gabrieli et al. used acoustic features of infants' cries (fundamental frequency, first four formants, and intensity) and demographic information (infants' gender, mothers' age) to predict the risk of PPD, obtained the AUROC, accuracy, recall and precision of 0.969, 0.895, 0.88 and 0.904 (Gabrieli et al., 2020). Fatima et al. employed 20 linguistic features of women's texts posted on Reddit to predict the risk of PPD, obtained the accuracy, recall and precision of 0.869, 0.868 and 0.87, and believed that the linguistic feature “family” played an important predictive role (Fatima et al., 2019). Shatte et al. evaluated the risk of PPD using postpartum changes in fathers' depression symptoms, including behavior, discussion topics, linguistic styles, and emotion. The results showed that demonstrated that the high-risk fathers exhibited lower behavior engagement with the platform, and engaged in fewer discussions categorized as Humor, Lifestyle, and Image related; the low-risk fathers have higher anger and use of swear words across both the prepartum and postpartum periods than the high-risk group. However, the performance metrics were not excellent, with the accuracy, recall and precision of 0.66, 0.68 and 0.67, which might be due to the poor applicability of the selected predictors because they were adapted from mothers for fathers, or neglecting fathers in possible high risk who may only passively observe discussions without actively contributing (Shatte et al., 2020).
此外,非结构化数据特征也可用于预测 PPD 风险,但预测效果差异很大,这可能与预测的 PPD 风险人群有关。例如,Gabrieli 等人使用婴儿哭声的声学特征(基频、前四个共振峰和强度)和人口统计信息(婴儿的性别、母亲的年龄)来预测 PPD 的风险,获得了 AUROC、准确率、召回率和精密度为 0.969、0.895、0.88 和 0.904Gabrieli 等人,2020 年).Fatima et al. 利用 Reddit 上发布的 20 种女性文本的语言特征来预测 PPD 的风险,获得了 0.869、0.868 和 0.87 的准确率、召回率和精密度,并认为语言特征“家庭”起着重要的预测作用(Fatima et al., 2019)。Shatte 等人使用父亲抑郁症状的产后变化(包括行为、讨论话题、语言风格和情绪)评估了 PPD 的风险。结果表明,这表明高危父亲对平台的行为参与度较低,并且参与的讨论较少,归类为幽默、生活方式和图像相关;与高危组相比,低风险父亲在产前和产后期间的愤怒和使用脏话的几率更高。然而,性能指标并不出色,准确率、召回率和精密度分别为 0.66、0.68 和 0.67,这可能是由于所选预测变量的适用性差,因为它们是从母亲改编为父亲的,或者忽略了可能处于高风险中的父亲,他们可能只是被动地观察讨论,而没有积极贡献(Shatte 等人, 2020 年)。

4.4. Outcomes measurement time
4.4. 结果测量时间

Except for Study Fatima et al. that did not define the PPD measurement time (Fatima et al., 2019), we found that the other included studies mainly measured PPD in 6 periods, and the most commonly measurement time were 6-week postpartum and 1-year postpartum. This is an obvious heterogeneity among included studies, leading to the difference of methods used in studies and the difference of results obtained from the studies with different measurement time. Regarding methods, we observed that most of studies that measured PPD risk at 6-week postpartum modeled using prospective clinical study data (Andersson et al., 2021; Fang, 2019; Payne et al., 2020; Xiao et al., 2020; Zhang et al., 2020); while most of studies that measured PPD risk during 1-year postpartum modeled based on retrospective data or database (Amit et al., 2021; Hochman et al., 2021; Shin et al., 2020; S. Wang et al., 2019; Zhang et al., 2021). This phenomenon might be explained by the fact that mothers are usually required to return to hospital at 6-week postpartum to check the health status of themselves and their infants, providing the chance for research to assessing maternal depression symptoms directly. Also, it is hard for research implement prospective monitoring of PPD during 1-year postpartum due to the limitation of manpower, energy and money. Regarding results, we found that the commonly used ML techniques of PPD risk at 6-week postpartum and during 1-year postpartum was SVM (Payne et al., 2020; Shatte et al., 2020; Zhang et al., 2020) and algorithms based on gradient- boosting (Amit et al., 2021; Hochman et al., 2021; Natarajan et al., 2017), respectively. This result was likely related to the data complexity of prospective studies and retrospective databases. As discussed above, SVM is suitable for multiple features and small-sample problems while algorithms based on gradient-boosting are suitable for class imbalance and larger sample.
除了 Fatima et al. 的研究没有定义 PPD 测量时间 (Fatima et al., 2019) 外,我们发现其他纳入的研究主要测量 6 个时期的 PPD,最常见的测量时间为产后 6 周和产后 1 年。这是纳入研究之间的明显异质性,导致研究中使用的方法不同,以及从不同测量时间的研究中获得的结果不同。关于方法,我们观察到大多数测量产后 6 周 PPD 风险的研究都是使用前瞻性临床研究数据建模的(Andersson 等人,2021 年;Fang, 2019;Payne et al., 2020;Xiao et al., 2020;Zhang et al., 2020);而大多数测量产后 1 年 PPD 风险的研究都是基于回顾性数据或数据库建模的(Amit et al., 2021;Hochman 等人,2021 年;Shin et al., 2020;S. Wang et al., 2019;Zhang et al., 2021)。这种现象可能是因为母亲通常需要在产后 6 周返回医院检查自己和婴儿的健康状况,从而为直接评估母亲抑郁症状的研究提供了机会。此外,由于人力、精力和金钱的限制,研究很难在产后 1 年内实施 PPD 的前瞻性监测。 关于结果,我们发现产后 6 周和产后 1 年 PPD 风险的常用 ML 技术是 SVM(Payne 等人,2020 年;Shatte 等人,2020 年;Zhang et al., 2020)和基于梯度提升的算法(Amit et al., 2021;Hochman 等人,2021 年;Natarajan et al., 2017) 分别。这一结果可能与前瞻性研究和回顾性数据库的数据复杂性有关。如上所述,SVM 适用于多特征和小样本问题,而基于梯度提升的算法适用于类不平衡和较大样本。

4.5. Deficiencies in model development and validation
4.5. 模型开发和验证的缺陷

Multiple studies on predicting PPD risk have been published in recent years. However, the total numbers of the studies published remain insignificant and their overall quality is not adequate. Additionally, the performance of the models proposed in these studies is uneven, and as a result, it cannot be directly applied in clinical practice. This indicates that ML has gained tremendous interest over the recent years, but it remains a relatively novel field. One possible reason is that the PROBAST was published in 2019 (Moons et al., 2019), and the standardized guidelines on the conduct of ML studies was proposed in 2020 (Stevens et al., 2020), thereby resulting in the overall lack of a method-based reference for some studies that were conducted before the publication of these guidelines. Therefore, the studies had many deficiencies in the development and validation processes, involving participants, predictors, outcome and analysis domains.
近年来发表了多项关于预测 PPD 风险的研究。然而,已发表的研究总数仍然微不足道,其总体质量不足。此外,这些研究中提出的模型的性能参差不齐,因此不能直接应用于临床实践。这表明 ML 近年来引起了极大的兴趣,但它仍然是一个相对较新的领域。一个可能的原因是 PROBAST 于 2019 年发布(Moons et al., 2019),而 2020 年提出了 ML 研究进行的标准化指南(Stevens et al., 2020),从而导致在这些指南发布之前进行的一些研究总体上缺乏基于方法的参考。因此,这些研究在开发和验证过程中存在许多缺陷,涉及参与者、预测因子、结果和分析领域。
Regarding participants, some studies obtained data from EHRs or medical system database (Amit et al., 2021; Hochman et al., 2021; Park et al., 2021; Shin et al., 2020; S. Wang et al., 2019; Zhang et al., 2021), and as a result, the predictors and outcomes may be missing or defined in different ways, thereby resulting in decreased data integrity and availability. Additionally, the criteria for eligibility in some studies excluded subgroups with the potential risk of PPD, such as women with a history of mental disorders (Fang, 2019), those with thyroid dysfunction (Xiao et al., 2020), and those suffering from chronic diseases. However, previous studies have shown that a history of mental disorders is a strong predictor for the risk of PPD (Shakeel et al., 2018), thyroid dysfunction poses an increased risk for PPD (RR 1.49, 95 % CI 1.11–2.00) (George et al., 2021; Johar et al., 2020), and chronic diseases, such as diabetes, are also associated with the increased risk for PPD (OR 2.23, 95 % CI 1.23–4.05) (Ruohomaki et al., 2018).
关于参与者,一些研究从 EHR 或医疗系统数据库获取数据(Amit 等人,2021 年;Hochman 等人,2021 年;Park 等人,2021 年;Shin et al., 2020;S. Wang et al., 2019;Zhang et al., 2021),因此,预测因子和结果可能缺失或以不同的方式定义,从而导致数据完整性和可用性降低。此外,一些研究的资格标准排除了具有 PPD 潜在风险的亚组,例如有精神障碍史的女性(Fang,2019 年)、甲状腺功能障碍患者(Xiao 等人,2020)和患有慢性病的女性 。然而,以前的研究表明, 精神障碍史是 PPD 风险的有力预测因素(Shakeel 等人,2018 年),甲状腺功能障碍会增加 PPD 的风险(RR 1.49,95 % CI 1.11-2.00)(George 等人,2021 年;Johar et al., 2020) 和糖尿病等慢性病也与 PPD 风险增加有关 (OR 2.23, 95 % CI 1.23–4.05) (Ruohomaki et al., 2018)。
Regarding predictors, some studies developed ML-based models that mainly focused on specific aspects, such as infants' cry vocalizations (Gabrieli et al., 2020) and the linguistic features of social-media posts(Fatima et al., 2019), which may result in a certain rate of missed screening owing to non-comprehensive predictors. Additionally, several studies did not report the definition, classification, measurement, and effectiveness of the included predictors in their final models, thereby resulting in a high risk of bias. Furthermore, the number of predictors in most studies exceeded 20, thereby making it significantly difficult to make assessments and recordings in a busy clinical setting and significantly reducing the applicability of the prediction models. Hochman et al. evaluated and compared the performance levels between a main model using 156 predictors and a Q-based model using 9 predictors, and they established that the latter was simpler and easier to implement, with only a slight reduction in accuracy (Hochman et al., 2021). Model performance metrics, especially accuracy and AUC scores, remain stable even when the number of variables used in the models is reduced from 100 % to 50 % and even when it is reduced to 25 % of all the variables available (5–10 variables) (Andersson et al., 2021). As discussed above, this may be due to redundancy during the evaluation of a large number of predictors, for example, depression and anxiety during pregnancy being highly correlated with some background and medical history variables.
关于预测因子,一些研究开发了基于 ML 的模型,主要关注特定方面,例如婴儿的哭声发声(Gabrieli et al., 2020)和社交媒体帖子的语言特征(Fatima et al., 2019),这可能导致一定的筛选漏检率由于预测因子不全面。此外,一些研究在其最终模型中没有报告所包含预测因子的定义、分类、测量和有效性,因此存在高偏倚风险。此外,大多数研究中的预测因子数量超过 20 个,从而使得在繁忙的临床环境中进行评估和记录变得非常困难,并显着降低了预测模型的适用性。Hochman 等人评估并比较了使用 156 个预测变量的主模型和使用 9 个预测变量的基于 Q 的模型之间的性能水平,他们确定后者更简单、更容易实现,准确性仅略有降低(Hochman et al., 2021)。即使模型中使用的变量数量从 100 % 减少到 50 %,甚至减少到所有可用变量的 25%(5-10 个变量),模型性能指标,尤其是准确性和 AUC 分数,也保持稳定(Andersson 等人,2021)。如上所述,这可能是由于在评估大量预测因子时存在冗余,例如,怀孕期间的抑郁和焦虑与一些背景和病史变量高度相关。
Regarding outcomes, the risk of bias in a few studies mainly originated from the use of suboptimal classification methods of PPD. Generally, the diagnosis of ICD-9/10 codes, and the results of scales screening or psychiatrist clinical interview are standard or accepted classification methods for PPD symptoms or risk. However, two studies assessed patient's PPD by their linguistic features, which was related to that modeling data were social media textual data. This raised two considerations regarding the risk of bias, one is that the results are not classified reasonably enough, and the other is that the measurement of outcome might utilize predictors, which would reduce the PPD risk model's predictive performance confidence.
关于结局,一些研究的偏倚风险主要源于使用次优的 PPD 分类方法。一般来说,ICD-9/10 代码的诊断以及量表筛查或精神科医生临床访谈的结果是 PPD 症状或风险的标准或公认的分类方法。然而,两项研究通过患者的语言特征评估了患者的 PPD,这与建模数据是社交媒体文本数据有关。这引发了关于偏倚风险的两个考虑因素,一个是结果分类不够合理,另一个是结果的测量可能会使用预测变量,这会降低 PPD 风险模型的预测性能置信度。
The analysis domain has the most problems associated with the risk of bias across all the studies. First, none of the included studies justified the appropriateness of their sample size, given the number of candidate predictors used in model development. The range of events per variable (EPV) in several studies was small and probably resulted in the risk of overfitting. Although the range of EPV for at least 20 has been recognized as no risk of bias in PROBAST, prediction models developed using ML techniques often require the number of EPV to be over 200 (Moons et al., 2019). Therefore, in our future studies, we shall use dimensional reduction techniques before modeling to reduce the number of candidate predictors if a small sample is inevitably used. Next, model internal validation in several studies used random split validation only. However, when the sample size is small, it has a low data utilization rate, resulting in the risk of bias, whereas K-fold cross validation and Bootstrap resampling are better alternatives (Moons et al., 2012b). To our regret, only approximately 20 % of the models underwent complete external verification. This is probably because most researchers pay more attention to model development than model validation, thereby resulting in the continuous emergence of models for predicting the risk of PPD over the recent years, but few of them have been effectively verified. Finally, most of the studies reported discrimination metrics only and lacked calibration metrics, indicating that they cannot achieve consistency between the event probabilities predicted by most ML-based models and the observed event probability. Actually, an excellent prediction model should have high discrimination and calibration levels, and the former should be the basis of the latter. If the calibration is not satisfactory, the model should be recalibrated to enhance performance (Moons et al., 2012a).
在所有研究中,分析领域与偏倚风险相关的问题最多。首先,考虑到模型开发中使用的候选预测因子的数量,纳入的研究都没有证明其样本量的适当性。在几项研究中,每个变量的事件范围 (EPV) 很小,可能导致过拟合的风险。尽管至少 20 的 EPV 范围已被公认为在 PROBAST 中没有偏倚风险,但使用 ML 技术开发的预测模型通常要求 EPV 的数量超过 200(Moons 等人,2019 年)。因此,在我们未来的研究中,如果不可避免地使用小样本,我们将在建模前使用降维技术来减少候选预测因子的数量。接下来,几项研究中的模型内部验证仅使用随机拆分验证。然而,当样本量较小时,数据利用率较低,导致偏倚风险,而 K 折叠交叉验证和 Bootstrap 重采样是更好的选择(Moons et al., 2012b)。遗憾的是,只有大约 20% 的模型进行了完整的外部验证。这可能是因为大多数研究人员更注重模型开发而不是模型验证,从而导致近年来预测 PPD 风险的模型不断涌现,但很少有得到有效验证的模型。最后,大多数研究仅报告了判别指标,缺乏校准指标,这表明它们无法在大多数基于 ML 的模型预测的事件概率与观察到的事件概率之间实现一致性。 其实,一个优秀的预测模型应该具有较高的判别和校准水平,前者应该是后者的基础。如果校准不令人满意,则应重新校准模型以提高性能(Moons et al., 2012a)。

4.6. Implications  4.6. 影响

Throughout the literature, there is a growing trend of using EHRs to develop ML-based models for predicting the risk of PPD, thereby resulting in higher requirements for data processing technology. In this systematic review, a total of four studies developed ML-based models for predicting the risk of PPD using EHRs (Amit et al., 2021; Hochman et al., 2021; S. Wang et al., 2019; Zhang et al., 2021), including the latest three studies. This tendency exists because EHRs comprise digital health information documented by tracking repeated measurements of patients' conditions over time, and they cover a large population (Aliabadi et al., 2020). Accurate and longitudinal EHRs have the potential to be used in the identification and analysis of new risk factors or outcomes at the individual and population levels. Additionally, ML is a vigorous data mining technique that is suitable for deriving further insight from large sets of multivariate medical data (Nwanosike et al., 2022). Therefore, EHR-based ML presents a potential to develop more comprehensive and usable models for risk stratification. However, EHRs, as sources of temporal data, usually have complex structures and unbalanced distribution of clinical events, thereby posing multiple technical challenges, such as data irregularity, heterogeneity, and sparsity. Given the limitations of general ML algorithms in addressing these challenges, advanced ML approaches, such as deep learning-based methods must be proposed in future studies (Xie et al., 2022).
纵观文献,使用 EHR 开发基于 ML 的模型来预测 PPD 风险的趋势越来越大,从而对数据处理技术提出了更高的要求。在本系统评价中,共有四项研究开发了基于 ML 的模型,用于使用 EHR 预测 PPD 风险(Amit 等人,2021 年;Hochman 等人,2021 年;S. Wang et al., 2019;Zhang et al., 2021),包括最新的三项研究。之所以存在这种趋势,是因为 EHR 包含通过跟踪患者病情随时间的重复测量来记录的数字健康信息,并且它们涵盖了大量人群(Aliabadi 等人,2020)。准确和纵向的 EHR 有可能用于识别和分析个人和人群水平的新风险因素或结果。此外,ML 是一种强大的数据挖掘技术,适用于从大量多变量医疗数据中获得进一步的见解(Nwanosike 等人,2022 年)。因此,基于 EHR 的 ML 有可能开发更全面和可用的风险分层模型。然而,EHR 作为时间数据的来源,通常具有复杂的结构和临床事件分布不平衡,从而带来了数据不规则性、异质性和稀疏性等多重技术挑战。鉴于通用 ML 算法在应对这些挑战方面的局限性,未来的研究必须提出先进的 ML 方法,例如基于深度学习的方法(Xie et al., 2022)。
Owing to advancements in ML technology, a series of ensemble and reinforcement learning algorithms show high performance, but the interpretability of such model decreases, and this is called the black-box effect. However, when using clinical data to make predictions and decisions, it is necessary to know both the model output results and the risk factors, their weight, correct interpretation, and specific guidance for clinical practice. Therefore, researchers should be mindful not to pursue high accuracy at the cost of the significance for clinical practice. However, in this systematic review, few models were translated into clinical tools and tested in clinical environments. While some studies have interpreted which features have a greater impact on ML models through feature importance, we could not explain the specific association between features and the prediction results (S. Wang et al., 2019). To improve the interpretability of ML models, recent studies have replaced feature importance with the SHAP mean to evaluate the contribution of each feature to the model output, especially for ensemble tree models, such as XGBoost (Hochman et al., 2021). In our opinion, rationally interpreting ML models or translating ML model into clinical tools are still the focus of researchers and clinical workers, and they are also the key to promoting the continuous development and improvement of ML models in clinical practice.
由于 ML 技术的进步,一系列集成和强化学习算法表现出高性能,但此类模型的可解释性降低,这称为黑盒效应。但是,在使用临床数据进行预测和决策时,有必要了解模型输出结果和风险因素、它们的权重、正确的解释以及对临床实践的具体指导。因此,研究人员应注意不要以牺牲临床实践的意义为代价来追求高准确性。然而,在这项系统评价中,很少有模型被转化为临床工具并在临床环境中进行测试。虽然一些研究通过特征重要性解释了哪些特征对 ML 模型的影响更大,但我们无法解释特征与预测结果之间的具体关联(S. Wang et al., 2019)。为了提高 ML 模型的可解释性,最近的研究用 SHAP 均值取代了特征重要性,以评估每个特征对模型输出的贡献,特别是对于集成树模型,例如 XGBoost(Hochman 等人,2021 年)。在我们看来,理性解读 ML 模型或 ML 模型转化为临床工具仍然是研究人员和临床工作者关注的重点,也是推动 ML 模型在临床实践中不断发展和改进的关键。
ML-based models have significant advantages over conventional risk-prediction tools because they can learn from a wide range of data types, including structured clinical, laboratory, and genetic information, along with unstructured textual, vocalization, and even imaging data. A previous study proved that the prediction performance of ML models used a combination of unstructured text data from clinical notes and structured clinical data was better than the models that do not use textual data (Ross et al., 2019). Our systematic review has found the effectiveness of ML-based models using structured data from clinical or unstructured data obtained from social media textual posts as predictors for the risk of PPD. However, few studies have integrated both of these aspects. Following this example, future studies should consider combining multiple data types to train ML-based models, thereby increasing predictive power.
与传统的风险预测工具相比,基于 ML 的模型具有显著优势,因为它们可以从广泛的数据类型中学习,包括结构化临床、实验室和遗传信息,以及非结构化文本、发声甚至成像数据。之前的一项研究证明,使用来自临床笔记的非结构化文本数据和结构化临床数据的组合的 ML 模型的预测性能优于不使用文本数据的模型(Ross et al., 2019)。 我们的系统评价发现,基于 ML 的模型使用来自临床的结构化数据或从社交媒体文本帖子中获得的非结构化数据作为 PPD 风险的预测因子。然而,很少有研究将这两个方面结合起来。按照这个例子,未来的研究应该考虑结合多种数据类型来训练基于 ML 的模型,从而提高预测能力。

5. Limitations  5. 限制

This study has some limitations. This systematic review aimed to synthesize the application of ML techniques in predicting PPD risk. However, owing to subjective factors such as inclusion and exclusion criteria and researcher consideration, the number of included studies was less than the full potential of actual ML applications, resulting in selection bias. To understand domestic and foreign research contexts, we included studies written in English and Chinese (n = 2), which slightly reduced the replicability of the search and review processes. One Chinese article could not be found in the Web of Science. Moreover, we could not quantitatively analyze the predictive performance of the included ML-based models for predicting PPD risk. This is related to the significant inconsistent performance metrics reported across the studies and the absence of meta-analysis methods applicable to prediction models constructed using ML techniques.
这项研究有一些局限性。本系统综述旨在综合 ML 技术在预测 PPD 风险中的应用。然而,由于纳入和排除标准以及研究人员考虑等主观因素,纳入的研究数量少于实际 ML 应用的全部潜力,导致选择偏倚。为了了解国内外的研究背景,我们纳入了用英文和中文撰写的研究 (n = 2),这略微降低了检索和综述过程的可复制性。在 Web of Science 中找不到一篇中文文章。此外,我们无法定量分析所包含的基于 ML 的模型对预测 PPD 风险的预测性能。这与研究中报告的显著不一致的性能指标以及缺乏适用于使用 ML 技术构建的预测模型的荟萃分析方法有关。

6. Conclusions  6. 结论

We reviewed 17 studies on the development and validation of ML-based models for predicting PPD risk. We comprehensively summarized the basic characteristics of all studies, process descriptions of model development and validation, important reporting of model performance, and assessed the risk of bias and applicability of the prediction models. The key finding was that few studies have achieved acceptable quality standards and ML techniques are yet to be implemented successfully in predicting PPD risk in clinical environments. Therefore, future studies must develop ML models based on standard methodologies, attach importance to model external validation, report results in detail, and transform models into those that can access open-source tools for the implementation of clinical decision support. We believe that these improvements will aid the ML-base prediction model of PPD risk successfully apply to the clinical practice in the near future.
我们回顾了 17 项关于开发和验证基于 ML 的模型预测 PPD 风险的研究。我们全面总结了所有研究的基本特征、模型开发和验证的过程描述、模型性能的重要报告,并评估了预测模型的偏倚风险和适用性。关键发现是,很少有研究达到可接受的质量标准,并且 ML 技术尚未成功实施以预测临床环境中的 PPD 风险。因此,未来的研究必须基于标准方法开发 ML 模型,重视模型外部验证,详细报告结果,并将模型转换为可以访问开源工具以实施临床决策支持的模型。我们相信,这些改进将有助于基于 ML 的 PPD 风险预测模型在不久的将来成功应用于临床实践。

CRediT authorship contribution statement
CRediT 作者贡献声明

MZ and XD proposed research question and coordinated task assignment. HZ and CY independently screened titles, abstracts and full text for eligibility. MZ extracted the data from the included papers and JJ checked the extracted data. MZ, HZ and CY assessed the risk of bias and applicability for included studies. Any discrepancies were resolved by XD and JJ. MZ, HZ and CY prepared the first version of the manuscript, XD and JJ revised the first draft of the manuscript. All authors revised and approved the final version of the manuscript.
MZ 和 XD 提出研究问题并协调任务分配。HZ 和 CY 独立筛选标题、摘要和全文以获得资格。MZ 从纳入的论文中提取数据,JJ 检查提取的数据。MZ、HZ 和 CY 评估了纳入研究的偏倚风险和适用性。任何差异均由 XD 和 JJ 解决。MZ、HZ 和 CY 准备了手稿的第一版,XD 和 JJ 修改了手稿的初稿。所有作者都修改并批准了手稿的最终版本。

Conflict of interest  利益冲突

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.
作者声明,该研究是在没有任何可能被解释为潜在利益冲突的商业或财务关系的情况下进行的。

Acknowledgement  确认

We would like to thank Tongji University Library for providing access to the databases and Editage (www.editage.cn) for English language editing
感谢同济大学图书馆提供数据库访问,并感谢意得辑 (www.editage.cn) 提供英文编辑服务

Funding  资金

This work was supported by the Science and Technology Commission of Shanghai Municipality [grant number 21Y11905900].
这项工作得到了上海市科学技术委员会 [资助号 21Y11905900] 的支持。

References  引用

Cited by (21)  被引用次数 (21)

View all citing articles on Scopus
查看 Scopus 上的所有施引文献
1
These authors have contributed equally to this work and share first authorship.
这些作者对这项工作做出了同等贡献,并分享了第一作者身份。
View Abstract  查看摘要