- PDF 李玲玲
- Split View 分屏视图
-
Views
-
Cite 引用
Cite
Zhi Zhao, John Zobolas, Manuela Zucknick, Tero Aittokallio, Tutorial on survival modeling with applications to omics data, Bioinformatics, Volume 40, Issue 3, March 2024, btae132, https://doi.org/10.1093/bioinformatics/btae132
- Share Icon Share 共享
Abstract 抽象
Identification of genomic, molecular and clinical markers prognostic of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics, transcriptomics, proteomics and metabolomics, and how these potential risk factors complement clinical characterization of patient outcomes for survival prognosis. However, the massive sizes of the omics datasets, along with their correlation structures, pose challenges for studying relationships between the molecular information and patients’ survival outcomes.
确定患者生存预后的基因组、分子和临床标志物对于开发个性化的疾病预防、诊断和治疗方法非常重要。现代组学技术使得在多个分子水平上研究标志物的预后影响成为可能,包括基因组学、表观基因组学、转录组学、蛋白质组学和代谢组学,以及这些潜在风险因素如何补充患者生存预后预后的临床特征。然而,组学数据集的庞大规模及其相关结构对研究分子信息与患者生存结果之间的关系构成了挑战。
We present a general workflow for survival analysis that is applicable to high-dimensional omics data as inputs when identifying survival-associated features and validating survival models. In particular, we focus on the commonly used Cox-type penalized regressions and hierarchical Bayesian models for feature selection in survival analysis, which are especially useful for high-dimensional data, but the framework is applicable more generally.
我们提出了一种生存分析的一般工作流程,适用于在识别生存相关特征和验证生存模型时作为输入的高维组学数据。特别是,我们关注了生存分析中常用的 Cox 型惩罚回归和分层贝叶斯模型进行特征选择,它们对高维数据特别有用,但该框架更普遍地适用。
可用性和实施
A step-by-step R tutorial using The Cancer Genome Atlas survival and omics data for the execution and evaluation of survival models has been made available at https://ocbe-uio.github.io/survomics.
https://ocbe-uio.github.io/survomics 上提供了使用癌症基因组图谱生存和组学数据执行和评估生存模型的分步 R 教程。
副主编: Jonathan Wren
1 Introduction 1 引言
Personalized medicine improves patient diagnosis and treatment by making use of patient-specific genomic and molecular markers that are indicative of disease development. Time to an event of interest (e.g. time to death or disease progression) is a widely used end point and patient outcome for many diseases, and therefore it has become popular to identify genomic, molecular and clinical markers for survival or progression prediction of patients suffering from complex diseases such as cancer. In this tutorial, we will only consider so-called right-censored survival data, where a patient has been followed for a certain time period and the event of interest is either observed in this time period or might occur at a later (as yet unobserved) time point. Right-censored survival data include both a time and the status of each patient at that time as joint outcomes, where time is a continuous variable and status is a binary variable indicating whether the event of interest has been observed up to the given time or not; in the latter case, we refer to this observation as censored at the observed time. See Table 1 for an example illustration, and more details of the survival data in Section 2. When using the status label as an outcome in an ordinary logistic regression, the regression coefficients will become increasingly uncertain and less reliable with increasing follow-up time (Green and Symons 1983). When using the observed time (or its transformation, e.g. logarithm of time) as an outcome in an ordinary linear regression, the presence of censored observations (i.e. patients still alive by the end of follow-up period) causes considerable difficulties for assessing the accuracy of predictions (Henderson 1995). For example, Table 1 shows an example where the proportion of patients surviving past 10 years is based on the observed data, but the (unobserved) factual proportion surviving past 10 years is .
个性化医疗通过利用指示疾病发展的患者特异性基因组和分子标志物来改善患者的诊断和治疗。发生感兴趣事件的时间(例如死亡或疾病进展的时间)是许多疾病广泛使用的终点和患者结果,因此识别基因组、分子和临床标志物以预测患有癌症等复杂疾病的患者的生存或进展已变得很流行。在本教程中,我们将只考虑所谓的右删失生存数据,其中患者已经被跟踪了一段时间,并且感兴趣的事件要么在该时间段内观察到,要么可能发生在较晚的(尚未观察到的)时间点。右删失生存数据包括时间和每个患者在该时间的状态作为联合结果,其中时间是一个连续变量,状态是一个二进制变量,表明在给定时间之前是否观察到感兴趣的事件;在后一种情况下,我们将此观测称为在观测时间删失。有关示例说明,请参见表 1,并在第 2 节中有关生存数据的更多详细信息。当在普通 logistic 回归中使用状态标签作为结果时,随着随访时间的增加,回归系数将变得越来越不确定和不可靠 (Green 和 Symons 1983)。当使用观测时间(或其变换,例如 时间的对数)作为普通线性回归的结果时,删失观察的存在(即 患者在随访期结束时仍然活着)对评估预测的准确性造成相当大的困难(Henderson 1995)。 例如,表 1 显示了一个例子,根据观察到的数据,存活超过 10 年的患者比例为 但(未观察到的)存活超过 10 年的事实比例为
Patient ID 患者 ID . | Observed time (years) 观察时间(年) . | Status at observed time 观察时的状态 . | Factual time to death, possibly unobserved (years) 实际死亡时间,可能未观察到(年) . |
---|---|---|---|
1001 | 11 | Censored 审查 | 20 |
1002 | 4 | Dead 死 | 4 |
1003 | 5 | Censored 审查 | 12 |
1004 | 9 | Dead 死 | 9 |
1005 | 1 | Censored 审查 | 11 |
Patient ID . | Observed time (years) . | Status at observed time . | Factual time to death, possibly unobserved (years) . |
---|---|---|---|
1001 | 11 | Censored | 20 |
1002 | 4 | Dead | 4 |
1003 | 5 | Censored | 12 |
1004 | 9 | Dead | 9 |
1005 | 1 | Censored | 11 |
If a patient was alive at the last observed time point, the measurement is censored at that time point.
如果患者在最后一个观察到的时间点还活着,则在该时间点删失测量值。
Patient ID 患者 ID . | Observed time (years) 观察时间(年) . | Status at observed time 观察时的状态 . | Factual time to death, possibly unobserved (years) 实际死亡时间,可能未观察到(年) . |
---|---|---|---|
1001 | 11 | Censored 审查 | 20 |
1002 | 4 | Dead 死 | 4 |
1003 | 5 | Censored 审查 | 12 |
1004 | 9 | Dead 死 | 9 |
1005 | 1 | Censored 审查 | 11 |
Patient ID . | Observed time (years) . | Status at observed time . | Factual time to death, possibly unobserved (years) . |
---|---|---|---|
1001 | 11 | Censored | 20 |
1002 | 4 | Dead | 4 |
1003 | 5 | Censored | 12 |
1004 | 9 | Dead | 9 |
1005 | 1 | Censored | 11 |
If a patient was alive at the last observed time point, the measurement is censored at that time point.
如果患者在最后一个观察到的时间点还活着,则在该时间点删失测量值。
The availability of multiple types of genomic and molecular data poses great opportunities but also further challenges for building effective statistical models to identify biomarkers that are prognostic of patient survival. For example, omics profiles, such as those from mRNA expression, DNA copy number, single-point and other genetic mutations, may be available from the same patient, and these high-dimensional data come with intra- and inter-dataset correlations, heterogeneous measurement scales, missing values, technical variability, and other background noise (Hasin et al. 2017). Rahnenführer et al. (2023) provided a general guideline for high-dimensional omics and electronic health records data analysis, and discussed statistical challenges and opportunities for survival modeling. Bøvelstad et al. (2007) and Bøvelstad et al. (2009) compared various machine learning methods and showed improved survival prediction performance by coefficient shrinkage methods that combine several data sources, in their case clinical and gene expression data. Another recent study by Bommert et al. (2022) performed a benchmark of filter methods for feature selection in high-dimensional gene expression data for survival prediction, and recommended using the simple variance filter before fitting a -regularized Cox regression model (Simon et al. 2011) for accurate survival prediction and stable feature selection. Vinga (2021) reviewed structured penalized regressions for analyzing high-dimensional omics data for survival prediction or evaluation. Multiple studies have shown that it is possible to further improve the prediction accuracy and feature selection by considering more complex structures, such as biological pathways, or by identifying significant features among functional relationships between the omics features (Chekouo et al. 2015, Kundu et al. 2018, Wang and Liu 2020, Madjar et al. 2021).
多种类型的基因组和分子数据的可用性带来了巨大的机遇,但也为构建有效的统计模型以识别预测患者生存的生物标志物带来了进一步的挑战。例如,组学图谱,例如来自 mRNA 表达、DNA 拷贝数、单点和其他基因突变的组学图谱,可能来自同一患者,并且这些高维数据带有数据集内和数据集间的相关性、异质性测量尺度、缺失值、技术可变性和其他背景噪声(Hasin 等人,2017 年)。Rahnenführer 等人(2023 年)为高维组学和电子健康记录数据分析提供了通用指南,并讨论了生存建模的统计挑战和机遇。Bøvelstad 等人 (2007) 和 Bøvelstad 等人 (2009) 比较了各种机器学习方法,并表明结合多个数据源的系数收缩方法提高了生存预测性能,在他们的案例中是临床和基因表达数据。Bommert 等人 (2022) 最近的另一项研究对高维基因表达数据中的特征选择进行了过滤方法的基准测试,用于生存预测,并建议在拟合 正则化 Cox 回归模型之前使用简单的方差过滤器 (Simon et al. 2011) 进行准确的生存预测和稳定的特征选择。Vinga (2021) 回顾了用于分析高维组学数据以进行生存预测或评估的结构化惩罚回归。 多项研究表明,通过考虑更复杂的结构(例如生物途径),或通过识别组学特征之间的功能关系中的显着特征,可以进一步提高预测准确性和特征选择(Chekouo 等人,2015 年,Kundu 等人,2018 年,Wang 和 Liu 2020 年,Madjar 等人,2021 年)。
In this tutorial, we describe a general workflow for survival analysis with omics data, as well as review the commonly used statistical methods for feature selection and survival prediction; importantly, we provide a step-by-step R tutorial using publicly available omics data from The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov). In this example dataset, the overall survival time, demographic and gene expression data from primary invasive breast cancer patients in TCGA (TCGA-BRCA) were retrieved from the Genomic Data Commons Data Portal data release 32.0–36.0. Compared to the previous reviews and benchmarks of survival models in bioinformatic applications, this tutorial provides a complete workflow ranging from data preparation to final calibrated models, with a particular focus on building survival models using high-dimensional omics data, and as such covers both the commonly used penalized regressions and Bayesian models for survival analysis with high-dimensional and generally noisy datasets. For this tutorial, we assume that readers have the knowledge of basic statistical methods. Terms beyond basic statistics are explained in corresponding text.
在本教程中,我们描述了使用组学数据进行生存分析的一般工作流程,并回顾了用于特征选择和生存预测的常用统计方法;重要的是,我们使用癌症基因组图谱 (TCGA) 项目 (http://cancergenome.nih.gov) 中公开可用的组学数据提供了分步 R 教程。在此示例数据集中,从 Genomic Data Commons Data Portal 数据版本 32.0-36.0 中检索了 TCGA (TCGA-BRCA) 原发性浸润性乳腺癌患者的总生存时间、人口统计学和基因表达数据。与之前生物信息学应用中生存模型的综述和基准测试相比,本教程提供了从数据准备到最终校准模型的完整工作流程,特别侧重于使用高维组学数据构建生存模型,因此涵盖了常用的惩罚回归和贝叶斯模型,用于高维和通常噪声数据集的生存分析。在本教程中,我们假设读者具有基本统计方法的知识。基本统计以外的术语在相应的文本中进行了解释。
2 Data categories 2 数据类别
We use omics data and overall survival outcomes from cancer patients as an example in this tutorial, but the methods are applicable also to other diseases with similar data types in personalized medicine applications. The ultimate goal of personalized medicine is to identify patient-specific risk factors to guide disease prevention, diagnostic and treatment strategies. The identification of potential risk factors for cancer patients often considers clinical, demographic, genomic and molecular information, and their associations with the patients’ time-to-event data (i.e. survival).
在本教程中,我们使用癌症患者的组学数据和总生存期结果作为示例,但这些方法也适用于个性化医疗应用中具有类似数据类型的其他疾病。个性化医疗的最终目标是确定患者特异性的风险因素,以指导疾病预防、诊断和治疗策略。确定癌症患者的潜在风险因素通常考虑临床、人口统计学、基因组和分子信息,以及它们与患者事件发生时间数据(即生存率)的关联。
2.1 Time-to-event data 2.1 事件发生时间数据
IME 到事件或生存数据包含感兴趣的事件(例如,死亡是本节中假设的事件),以及从研究期开始到事件发生、研究结束或患者失访的时间(即右删失;右、左和区间删失的讨论可以在 Leung 等人 1997 中找到)。图 1c 显示了患者自癌症诊断以来的生存时间或右删失时间。由于删失,可能无法观察到患者的确切生存时间。因此,患者有两个结果指数:删失指标(也称为状态)和观察时间
这是患者在时间 t 死亡的瞬时概率,条件是患者存活到该时间点。
2.2 Clinical and demographic data
2.2 临床和人口统计数据
There are multiple sources of patient-level information that can be explored to identify risk factors for cancer patients. One can start with routinely collected and commonly used patient data, such as clinical and demographic variables. For example, an older male patient with a low body mass index (BMI) has a relatively high risk of gastric cancer (Nam et al. 2022). Table 2 illustrates selected clinical and demographic variables often available for cancer patients. Clinical and demographic variables are considered important sources of information for predicting survival and are often used to build reference models for omics-based prognostic models (Herrmann et al. 2021).
可以探索多种患者层面的信息来源,以确定癌症患者的风险因素。可以从常规收集和常用的患者数据开始,例如临床和人口统计变量。例如,体重指数 (BMI) 低的老年男性患者患胃癌的风险相对较高(Nam 等人,2022 年)。表 2 说明了癌症患者通常可用的选定临床和人口统计变量。临床和人口统计变量被认为是预测生存的重要信息来源,通常用于为基于组学的预后模型构建参考模型(Herrmann 等人,2021 年)。
Variable 变量 . | Data type 数据类型 . |
---|---|
Sex 性 | Binary 二元的 |
Body mass index (BMI, 体重指数 (BMI, | Continuous 连续的 |
Ethnicity 种族 | Nominal 名义 |
Age at first diagnosis in years 首次诊断年龄 | Integer 整数 |
Pathological stage 病理阶段 | Ordinal 序数 |
Therapy type (e.g. chemo-, hormone, immuno-therapy) 治疗类型(例如 化疗、激素、免疫疗法) | Nominal 名义 |
Variable . | Data type . |
---|---|
Sex | Binary |
Body mass index (BMI, | Continuous |
Ethnicity | Nominal |
Age at first diagnosis in years | Integer |
Pathological stage | Ordinal |
Therapy type (e.g. chemo-, hormone, immuno-therapy) | Nominal |
Variable 变量 . | Data type 数据类型 . |
---|---|
Sex 性 | Binary 二元的 |
Body mass index (BMI, 体重指数 (BMI, | Continuous 连续的 |
Ethnicity 种族 | Nominal 名义 |
Age at first diagnosis in years 首次诊断年龄 | Integer 整数 |
Pathological stage 病理阶段 | Ordinal 序数 |
Therapy type (e.g. chemo-, hormone, immuno-therapy) 治疗类型(例如 化疗、激素、免疫疗法) | Nominal 名义 |
Variable . | Data type . |
---|---|
Sex | Binary |
Body mass index (BMI, | Continuous |
Ethnicity | Nominal |
Age at first diagnosis in years | Integer |
Pathological stage | Ordinal |
Therapy type (e.g. chemo-, hormone, immuno-therapy) | Nominal |
2.3 Omics data 2.3 组学数据
Thanks to the rapid development of modern molecular biotechnology, large amounts of human genomic and molecular data have become available from many patient profiling projects. These projects often collect multiple levels of molecular information such as genomics data for DNA variation, transcriptomics data for mRNA expression, proteomics data for protein abundance and metabolomics data for metabolite processes, as illustrated in Fig. 1a. Among the multiple omics levels, metabolomics is the closest to observable phenotypes, such as tumor growth and proliferation (Cairns et al. 2011). To deeply understand the molecular biology of tumor development, multiple levels of omics data may deliver novel insights into the circuits of molecular interactions that underlie the disease initiation and progression (Tarazona et al. 2021).
由于现代分子生物技术的快速发展,许多患者分析项目已经提供了大量的人类基因组和分子数据。这些项目通常收集多个层次的分子信息,例如 DNA 变异的基因组学数据、mRNA 表达的转录组学数据、蛋白质丰度的蛋白质组学数据和代谢物过程的代谢组学数据,如图 1a 所示 。在多个组学水平中,代谢组学是最接近可观察的表型的,例如肿瘤生长和增殖(Cairns 等人,2011 年)。为了深入了解肿瘤发展的分子生物学,多层次的组学数据可能会为构成疾病发生和进展基础的分子相互作用回路提供新的见解(Tarazona 等人,2021 年)。
DNA-level omics data often include single nucleotide polymorphisms (SNP), DNA methylation, and somatic copy number variation, see illustration of these data types in Table 3. Each SNP feature can be coded as
DNA 水平组学数据通常包括单核苷酸多态性 (SNP)、DNA 甲基化和体细胞拷贝数变异,参见表 3 中这些数据类型的插图。每个 SNP 特征都可以编码为
Patient ID 患者 ID . | SNP SNP (SNP) . | Methylation ( 甲基化 ( | Copy number variation (number of copies) 拷贝数变化(拷贝数) . | Gene expression (reads) 基因表达 (reads) . | miRNA (reads) miRNA(读取) . | Protein (intensity/concentration) . | Metabolite (intensity/concentration) . |
---|---|---|---|---|---|---|---|
1001 | 1 | 0.2 | 0 | 5 | 5 | 0.07 | 0.07 |
1002 | 0 | 0.11 | 5 | 2 | 2 | 0.1 | 0.1 |
1003 | 1 | 0.95 | 10 | 0 | 0 | 9.6 | 9.6 |
1004 | 2 | 0.5 | 4 | 30 | 30 | 2.8 | 2.8 |
Technology 科技 | SNP array SNP 阵列 | Infinium BeadChip | SNP array (Gistic2) SNP 阵列 (Gistic2) | RNA-Seq RNA-测序 | RNA-Seq RNA-测序 | Mass spectrometry | Mass spectrometry |
Distribution assumptions 分配假设 | Ordinal 序数 | Beta mixture β 混合物 | Negative binomial 负二项式 | Negative binomial 负二项式 | Negative binomial 负二项式 | log-Gaussian | log-Gaussian |
log2 ratio: Gaussian 对数2 比值:高斯 | log2 scale: Gaussian 对数2 尺度:高斯 | log2 scale: Gaussian 对数2 尺度:高斯 | log2 scale: Gaussian | log2 scale: Gaussian |
Patient ID . | SNP . | Methylation ( | Copy number variation (number of copies) . | Gene expression (reads) . | miRNA (reads) . | Protein (intensity/concentration) . | Metabolite (intensity/concentration) . |
---|---|---|---|---|---|---|---|
1001 | 1 | 0.2 | 0 | 5 | 5 | 0.07 | 0.07 |
1002 | 0 | 0.11 | 5 | 2 | 2 | 0.1 | 0.1 |
1003 | 1 | 0.95 | 10 | 0 | 0 | 9.6 | 9.6 |
1004 | 2 | 0.5 | 4 | 30 | 30 | 2.8 | 2.8 |
Technology | SNP array | Infinium BeadChip | SNP array (Gistic2) | RNA-Seq | RNA-Seq | Mass spectrometry | Mass spectrometry |
Distribution assumptions | Ordinal | Beta mixture | Negative binomial | Negative binomial | Negative binomial | log-Gaussian | log-Gaussian |
log2 ratio: Gaussian | log2 scale: Gaussian | log2 scale: Gaussian | log2 scale: Gaussian | log2 scale: Gaussian |
Patient ID 患者 ID . | SNP SNP (SNP) . | Methylation ( 甲基化 ( | Copy number variation (number of copies) 拷贝数变化(拷贝数) . | Gene expression (reads) 基因表达 (reads) . | miRNA (reads) miRNA(读取) . | Protein (intensity/concentration) 蛋白质(强度/浓度) . | Metabolite (intensity/concentration) . |
---|---|---|---|---|---|---|---|
1001 | 1 | 0.2 | 0 | 5 | 5 | 0.07 | 0.07 |
1002 | 0 | 0.11 | 5 | 2 | 2 | 0.1 | 0.1 |
1003 | 1 | 0.95 | 10 | 0 | 0 | 9.6 | 9.6 |
1004 | 2 | 0.5 | 4 | 30 | 30 | 2.8 | 2.8 |
Technology 科技 | SNP array SNP 阵列 | Infinium BeadChip | SNP array (Gistic2) SNP 阵列 (Gistic2) | RNA-Seq RNA-测序 | RNA-Seq RNA-测序 | Mass spectrometry 质谱法 | Mass spectrometry |
Distribution assumptions 分配假设 | Ordinal 序数 | Beta mixture β 混合物 | Negative binomial 负二项式 | Negative binomial 负二项式 | Negative binomial 负二项式 | log-Gaussian 对数高斯 | log-Gaussian |
log2 ratio: Gaussian 对数2 比值:高斯 | log2 scale: Gaussian 对数2 尺度:高斯 | log2 scale: Gaussian 对数2 尺度:高斯 | log2 scale: Gaussian 对数2 尺度:高斯 | log2 scale: Gaussian |
Patient ID . | SNP . | Methylation ( | Copy number variation (number of copies) . | Gene expression (reads) . | miRNA (reads) . | Protein (intensity/concentration) . | Metabolite (intensity/concentration) . |
---|---|---|---|---|---|---|---|
1001 | 1 | 0.2 | 0 | 5 | 5 | 0.07 | 0.07 |
1002 | 0 | 0.11 | 5 | 2 | 2 | 0.1 | 0.1 |
1003 | 1 | 0.95 | 10 | 0 | 0 | 9.6 | 9.6 |
1004 | 2 | 0.5 | 4 | 30 | 30 | 2.8 | 2.8 |
Technology | SNP array | Infinium BeadChip | SNP array (Gistic2) | RNA-Seq | RNA-Seq | Mass spectrometry | Mass spectrometry |
Distribution assumptions | Ordinal | Beta mixture | Negative binomial | Negative binomial | Negative binomial | log-Gaussian | log-Gaussian |
log2 ratio: Gaussian | log2 scale: Gaussian | log2 scale: Gaussian | log2 scale: Gaussian | log2 scale: Gaussian |
RNA-level omics data usually include messenger RNA (mRNA) expression and microRNA (miRNA) expression. Traditionally, DNA microarrays were used to measure the expression levels of DNA sequences called probes, which acted as a proxy for the amount of reads representing a genomic feature of interest. The reads of microarray expression level can be characterized by a negative binomial distribution directly, or a Gaussian distribution after log2 transformation (see Table 3). Nowadays, RNA sequencing (RNA-seq) has replaced DNA microarrays, since it allows for the sequencing of the whole transcriptome, while DNA microarrays only profile predefined transcripts or gene probes. miRNAs are small noncoding regulatory RNAs that play an important role in regulating gene expression and are highly evolutionary conserved in mammals (Bartel 2018).
RNA 水平组学数据通常包括信使 RNA (mRNA) 表达和 microRNA (miRNA) 表达。传统上,DNA 微阵列用于测量称为探针的 DNA 序列的表达水平,探针充当代表目标基因组特征的读数量的代理。微阵列表达水平的读数可以直接用负二项分布或 log2 转换后的高斯分布来表征(见表 3)。如今,RNA 测序 (RNA-seq) 已经取代了 DNA 微阵列,因为它允许对整个转录组进行测序,而 DNA 微阵列仅分析预定义的转录本或基因探针。miRNA 是小的非编码调节 RNA,在调节基因表达中起重要作用,并且在哺乳动物中高度进化保守 (Bartel 2018)。
Protein-level omics data usually originates from mass spectrometry (MS)-based proteomics profiling of a particular cell, tissue, organ, which can detect and measure abundance levels of entire or phosphorylated proteins. In contrast to global proteomics with MS, TCGA consortium has produced more targeted proteomics profiles, involving a set of 180–250 protein features using reverse-phase protein arrays (RPPA) (Akbani et al. 2014). In contrast, The Clinical Proteomic Tumor Analysis Consortium (CPTAC) profiling has produced >10 000 protein features using MS technology (Edwards et al. 2015). Protein expression data can often be considered approximately Gaussian distributed after log2 transformation, depending on the data-generating platform. In global proteomics, there are often sample-specific missing values due to detection limits of protein quantification.
蛋白质水平组学数据通常来源于基于质谱 (MS) 的特定细胞、组织、器官的蛋白质组学分析,可以检测和测量完整或磷酸化蛋白质的丰度水平。与使用 MS 的全球蛋白质组学相比,TCGA 联盟产生了更具针对性的蛋白质组学概况,涉及使用反相蛋白质阵列 (RPPA) 的一组 180-250 个蛋白质特征(Akbani 等人,2014 年)。相比之下,临床蛋白质组学肿瘤分析联盟 (CPTAC) 分析使用 MS 技术产生了 >10 000 个蛋白质特征(Edwards 等人,2015 年)。根据数据生成平台,在 log2 转换后,蛋白质表达数据通常可以被认为是近似高斯分布的。在全球蛋白质组学中,由于蛋白质定量的检测限,经常存在样品特异性缺失值。
Metabolite-level omics data has similar statistical properties to proteomics data, and metabolomics profiling is usually done with MS-based technologies, enabling the detection, and quantification of many thousands of metabolic features simultaneously. Nuclear magnetic resonance (NMR) spectroscopy is the other main analytical technology to profile metabolic processes. An important limitation of NMR spectroscopy is its relatively low sensitivity, which may lead to relatively few detected metabolites. Metabolic concentrations often follow logarithmic Gaussian distribution. Similar to large-scale proteomics, missing values due to detection limits should be treated differently than missing values due to measurement artefacts, which are more frequent in metabolite-level omics data (Sun and Xia 2023).
代谢物水平的组学数据具有与蛋白质组学数据相似的统计特性,代谢组学分析通常使用基于 MS 的技术完成,能够同时检测和定量数千个代谢特征。核磁共振 (NMR) 波谱是描述代谢过程的另一种主要分析技术。NMR 波谱的一个重要局限性是其灵敏度相对较低,这可能导致检测到的代谢物相对较少。代谢浓度通常服从对数高斯分布。与大规模蛋白质组学类似,应区别对待由于检测限导致的缺失值与由于测量伪影导致的缺失值,后者在代谢物水平的组学数据中更为常见(Sun 和 Xia 2023)。
Single-cell sequencing is becoming increasingly more prevalent in many profiling studies. The modern single-cell omics technologies can produce multiple levels of omics data derived from the same samples, such as transcriptomics and chromatin accessibility. The single-cell data types are similar to the illustration in Table 3, but each measurement originates from the level of individual cells. This article focuses mainly on survival analysis with single bulk omics data.
单细胞测序在许多分析研究中变得越来越普遍。现代单细胞组学技术可以产生来自相同样本的多层次组学数据,例如转录组学和染色质可及性。单单元数据类型与表 3 中的插图类似,但每个测量值都源自单个单元的级别。本文主要关注使用单个批量组学数据的生存分析。
2.4 Missing data 2.4 缺失数据
Missing values are often observed in many types of high-dimensional omics data due to various experimental reasons (Aittokallio 2009, Kong et al. 2022). For example, mRNA transcriptomics data from microarrays have 1%–10% missing values affecting up to 95% of genes due to corruption of image, scratches on the slides, poor hybridization, inadequate resolution, fabrication errors (de Brevern et al. 2004, Tuikkala et al. 2005); MS-based metabolomics data have 10%–20% missing values affecting up to 80% of variables due to lack of peak identification by chromatogram, limitation of computational detection, measurement error, and deconvolution errors (Hrydziuszko and Viant 2012, Sun and Xia 2023). The aforementioned technical reasons can lead to missing data that is either missing at random (MAR) or missing not at random (MNAR). When dealing with MNAR data, traditional imputation methods like multiple imputation may introduce bias. It is thus recommended to remove omics features which have large proportion (e.g. 50%) missingness over patients, and then apply imputation methods (e.g. k-nearest neighbor imputation) for the rest of the features with missing values before doing any statistical analysis or modeling. Alternatively, imputation-free methods (e.g. mixture models) that can deal with missing values can be applied directly (Taylor et al. 2022). Single-cell RNA-seq (scRNA-seq) data has a vast number of zeros, so-called gene dropout events, leading highly scarce data matrices. Jiang et al. (2022) discussed the sources of biological and nonbiological zeros in scRNA-seq data and the existing approaches for handling them.
由于各种实验原因,在许多类型的高维组学数据中经常观察到缺失值(Aittokallio 2009,Kong et al. 2022)。例如,由于图像损坏、载玻片上的划痕、杂交不良、分辨率不足、制造错误,来自微阵列的 mRNA 转录组学数据有 1%-10% 的缺失值,影响高达 95% 的基因(de Brevern 等人,2004 年,Tuikkala 等人,2005 年);由于无法通过色谱图鉴定峰、计算检测的限制、测量误差和反卷积误差,基于 MS 的代谢组学数据存在 10%-20% 的缺失值,影响高达 80% 的变量(Hrydziuszko 和 Viant 2012,Sun 和 Xia 2023)。上述技术原因可能导致随机缺失 (MAR) 或非随机缺失 (MNAR) 缺失数据。在处理 MNAR 数据时,传统的插补方法(如多重插补)可能会引入偏倚。因此,建议删除患者缺失比例较大(例如 50%)的组学特征,然后在进行任何统计分析或建模之前对其余缺失值的特征应用插补方法(例如 k 最近邻插补)。或者,可以直接应用可以处理缺失值的无插补方法(例如 混合模型)(Taylor 等人,2022 年)。单细胞 RNA-seq (scRNA-seq) 数据有大量的零,即所谓的基因缺失事件,导致数据矩阵高度稀缺。江 et al. (2022) 讨论了 scRNA-seq 数据中生物和非生物零的来源以及处理它们的现有方法。
3 Survival analysis with low-dimensional input data
3 使用低维输入数据进行生存分析
Let us assume we have data
假设我们有数据
Figure 2a shows the KM curve for TCGA-BRCA primary breast cancer patients data. Some basic statistics can be revealed from the survival curve. For example, the estimated median survival time, i.e. the time when the survival probability is 50%, of all the breast cancer patients is 10.8 years (dashed line in Fig. 2a), and 1-, 5-, and 10-year survival probabilities are 0.988, 0.853, and 0.658, respectively. A log-rank test (Peto and Peto 1972) can be used to test whether two groups of patients [e.g. with treatment (pharmaceutical/radiation therapy) or nontreatment] have the same (null hypothesis) or different survival functions (alternative hypothesis), and provide a corresponding P-value (see Fig. 2b). The log-rank test can also be used to compare the survival probabilities of any subgroups of patients based on other categorical variables.
图 2a 显示了 TCGA-BRCA 原发性乳腺癌患者的 KM 曲线数据。从生存曲线中可以揭示一些基本统计数据。例如,所有乳腺癌患者的估计中位生存时间,即生存概率为 50% 的时间为 10.8 年(图 2a 中的虚线),1 年、5 年和 10 年生存概率分别为 0.988、0.853 和 0.658。对数秩检验(Peto 和 Peto 1972)可用于检验两组患者 [例如接受治疗(药物治疗/放射治疗)或未接受治疗] 是否具有相同(零假设)或不同的生存函数(替代假设),并提供相应的 P 值(见图 2b)。对数秩检验还可用于根据其他分类变量比较任何患者亚组的生存概率。
在有多种临床、人口统计学或组学特征可用的情况下,可以分别探索每个变量与生存结果的关联。对于分类变量,可以使用 KM 曲线和对数秩检验来调查按变量分类的多条生存曲线之间是否存在差异。对于连续变量 X,通常使用半参数 Cox 比例风险模型(Cox 模型,Cox 1972):
其中
It is often of interest to gain insights into the multiple factors and their cooperation for the survival outcomes. Multivariable statistical analysis plays an important role in such multi-modal survival modeling. The univariate Cox model (1) can be straightforwardly generalized by including multiple clinical, demographic or omics variables of interest, as long as the total number of covariates is much smaller than the number of samples; this often requires the use of variable or feature selection methods. Heinze et al. (2018) provides pragmatic recommendations for practitioners of multivariable analysis with variable selection methods in low-dimensional modeling problems.
深入了解多种因素及其对生存结果的合作通常是有趣的。多变量统计分析在这种多模态生存建模中起着重要作用。单变量 Cox 模型 (1) 可以通过包括多个感兴趣的临床、人口统计学或组学变量来直接推广,只要协变量的总数远小于样本的数量;这通常需要使用变量或特征选择方法。Heinze et al. (2018) 为在低维建模问题中使用变量选择方法进行多变量分析的从业者提供了务实的建议。
4 Survival analysis with high-dimensional input data
4 使用高维输入数据的生存分析
Since some omics data contain high-number of variables (e.g. RNA-seq with ca. 60 000 transcriptomic features and DNA methylation with ca. 450 000 features), there is a need to reduce the computational and modeling burden in multivariable analyses. One heuristics approach to pre-select a subset of features is to include only omics features at a specific statistical significance level when fitting a univariate Cox model (1); the pre-specified significance level would usually be higher than the commonly used 0.05 threshold, e.g. 0.1 or 0.2, to avoid losing important features. However, this univariate approach focuses on features that are independently associated with the outcome and might miss variables that are important in combination with other features (Okser et al. 2013).
由于一些组学数据包含大量变量(例如,具有约 60 000 个转录组特征的 RNA-seq 和具有约 450 000 个特征的 DNA 甲基化),因此需要减少多变量分析中的计算和建模负担。一种预先选择特征子集的启发式方法是在拟合单变量 Cox 模型时仅包括特定统计显着性水平的组学特征 (1);预先设定的显著性水平通常会高于常用的 0.05 阈值,例如 0.1 或 0.2,以避免丢失重要特征。然而,这种单变量方法侧重于与结果独立相关的特征,并且可能会错过与其他特征相结合的重要变量(Okser 等人,2013 年)。
Another simple approach is to pre-select omics features by variance, since larger variability across patients usually implies higher biological information, or at least predictive signal. One can for example pre-select omics features explaining 50% of the total cumulative variance (Zhao and Zucknick 2020). Such unsupervised feature pre-selection is a recommended method to reduce dimensionality when dealing with hundreds or thousands of omics features, with the aim to improve the stability of final feature selection. For example, Bommert et al. (2022) showed that the simple variance filter was the best method among all considered filter methods in terms of the predictive accuracy, run time and the feature selection stability in their benchmark study. Zhao et al. (2024) showed that the stability of final feature subset critically depends on the pre-selected feature set when using a standard Bayesian stochastic search variable selection method (see Section 4.3), and their proposed method that used known biological relationships between omics features lead to a more stable feature selection and slightly improved outcome prediction.
另一种简单的方法是通过方差预先选择组学特征,因为患者之间更大的变异性通常意味着更高的生物学信息,或者至少是预测信号。例如,可以预先选择组学特征来解释总累积方差的 50%(Zhao 和 Zucknick 2020)。这种无监督特征预选是在处理成百上千个组学特征时降低维度的推荐方法,目的是提高最终特征选择的稳定性。例如,Bommert 等人(2022 年)表明,在他们的基准研究中,就预测准确性、运行时间和特征选择稳定性而言,简单方差滤波器是所有考虑的滤波器方法中最好的方法。Zhao et al. (2024) 表明,当使用标准贝叶斯随机搜索变量选择方法时,最终特征子集的稳定性在很大程度上取决于预先选择的特征集(参见第 4.3 节),他们提出的方法使用组学特征之间已知的生物学关系导致更稳定的特征选择和略微改善的结果预测。
When drawing conclusions on survival differences solely from the univariate Cox model (1), it is important to adjust the P-values of risk features for multiple comparisons to control false positives globally, e.g. by controlling the family-wise error rate (FWER) or false discovery rate (FDR). The problem of multiple testing is beyond the scope of this tutorial, but the interested readers are referred to a recent review article (Korthauer et al. 2019). We note also that univariate analysis does not consider any relationships between multiple omics features, e.g. potential confounders for survival outcomes (Clark et al. 2003). To avoid making seriously misleading conclusions in such cases, it is necessary to perform multivariable survival analysis (Bradburn et al. 2003).
当仅从单变量 Cox 模型 (1) 得出关于生存差异的结论时,重要的是要调整风险特征的 P 值以进行多重比较以全局控制假阳性,例如通过控制家庭错误率 (FWER) 或错误发现率 (FDR)。多重测试的问题超出了本教程的范围,但感兴趣的读者可以参考最近的一篇评论文章(Korthauer et al. 2019)。我们还注意到,单变量分析不考虑多个组学特征之间的任何关系,例如 生存结果的潜在混杂因素 (Clark et al. 2003)。为避免在这种情况下做出严重误导性的结论,有必要进行多变量生存分析(Bradburn 等人,2003 年)。
Omics data can have hundreds of thousands of variables measured at various molecular levels, which greatly challenges the classical multivariable regression models for time-to-event endpoints, since the number of variables is often much larger than the number of patients (i.e.
组学数据可以有数十万个在不同分子水平上测量的变量,这极大地挑战了用于事件发生时间终点的经典多变量回归模型,因为变量的数量通常比患者的数量大得多(即
Method 方法 . | Feature selection via 特征选择 . | Grouping effects 分组效果 . | Uncertainty quantification 不确定性量化 . | Comment 评论 . |
---|---|---|---|---|
Penalized regressions: 惩罚回归: | Penalty 罚款 | |||
Lasso Cox (Tibshirani 1997) Lasso Cox (Tibshirani 1997) | ✗ | ✗ | ||
Adaptive Lasso Cox (Zhang and Lu 2007) 适应性套索 Cox (Zhang 和 Lu 2007) | Weighted 加权 | ✗ | ✗ | Less false positives than Lasso 比 Lasso 更少的误报 |
Elastic Net Cox (Simon et al. 2011) Elastic Net Cox (Simon et al. 2011) | ✓ | ✗ | ||
Group-Lasso Cox (Kim et al. 2012) Group-Lasso Cox (Kim et al. 2012) | ✓ | ✗ | Independent groups of features selected 选定的独立功能组 | |
Sparse Group-Lasso Cox (Simon et al. 2013) 稀疏群-Lasso Cox (Simon et al. 2013) | ✓ | ✗ | ||
SCAD Cox (Fan and Li 2002) SCAD Cox (Fan 和 Li 2002) | Quadratic spline and symmetric penalty 二次样条和对称罚 | ✓ | ✗ | Selection of relatively large effects 选择相对较大的效应 |
SIS Cox (Fan et al. 2010) SIS Cox (Fan et al. 2010) | Top-ranked variables and any penalty 排名靠前的变量和任何惩罚 | ✓ | ✗ | Suited to ultra-high dimensions 适用于超高尺寸 |
Bayesian models: 贝叶斯模型: | Shrinkage prior 收缩先验 | |||
Lee et al. (2011) Lee 等人 (2011) | Lasso (Laplace) prior 套索 (拉普拉斯) prior | ✗ | ✓ | Selection of posterior mean with a cutoff 选择具有截止值的后平均值 |
Lee et al. (2015) Lee 等人 (2015) | Elastic Net, group/fused Lasso priors Elastic Net,分组/融合 Lasso 先验 | ✓ | ✓ | Selection of posterior mean with a cutoff 选择具有截止值的后平均值 |
Konrath et al. (2013) Konrath 等人 (2013) | Spike-and-slab prior 钉板先验 | ✗ | ✓ | |
Madjar et al. (2021) Madjar 等人 (2021) | Spike-and-slab and MRF priors Spike-and-slab 和 MRF 先验 | ✓ | ✓ | |
Mu et al. (2021) Mu 等人 (2021) | Horseshoe prior 马蹄形 | ✗ | ✓ | Selection of posterior mean with a cutoff 选择具有截止值的后平均值 |
Method . | Feature selection via . | Grouping effects . | Uncertainty quantification . | Comment . |
---|---|---|---|---|
Penalized regressions: | Penalty | |||
Lasso Cox (Tibshirani 1997) | ✗ | ✗ | ||
Adaptive Lasso Cox (Zhang and Lu 2007) | Weighted | ✗ | ✗ | Less false positives than Lasso |
Elastic Net Cox (Simon et al. 2011) | ✓ | ✗ | ||
Group-Lasso Cox (Kim et al. 2012) | ✓ | ✗ | Independent groups of features selected | |
Sparse Group-Lasso Cox (Simon et al. 2013) | ✓ | ✗ | ||
SCAD Cox (Fan and Li 2002) | Quadratic spline and symmetric penalty | ✓ | ✗ | Selection of relatively large effects |
SIS Cox (Fan et al. 2010) | Top-ranked variables and any penalty | ✓ | ✗ | Suited to ultra-high dimensions |
Bayesian models: | Shrinkage prior | |||
Lee et al. (2011) | Lasso (Laplace) prior | ✗ | ✓ | Selection of posterior mean with a cutoff |
Lee et al. (2015) | Elastic Net, group/fused Lasso priors | ✓ | ✓ | Selection of posterior mean with a cutoff |
Konrath et al. (2013) | Spike-and-slab prior | ✗ | ✓ | |
Madjar et al. (2021) | Spike-and-slab and MRF priors | ✓ | ✓ | |
Mu et al. (2021) | Horseshoe prior | ✗ | ✓ | Selection of posterior mean with a cutoff |
Method 方法 . | Feature selection via 特征选择 . | Grouping effects 分组效果 . | Uncertainty quantification 不确定性量化 . | Comment 评论 . |
---|---|---|---|---|
Penalized regressions: 惩罚回归: | Penalty 罚款 | |||
Lasso Cox (Tibshirani 1997) Lasso Cox (Tibshirani 1997) | ✗ | ✗ | ||
Adaptive Lasso Cox (Zhang and Lu 2007) 适应性套索 Cox (Zhang 和 Lu 2007) | Weighted 加权 | ✗ | ✗ | Less false positives than Lasso 比 Lasso 更少的误报 |
Elastic Net Cox (Simon et al. 2011) Elastic Net Cox (Simon et al. 2011) | ✓ | ✗ | ||
Group-Lasso Cox (Kim et al. 2012) Group-Lasso Cox (Kim et al. 2012) | ✓ | ✗ | Independent groups of features selected 选定的独立功能组 | |
Sparse Group-Lasso Cox (Simon et al. 2013) 稀疏群-Lasso Cox (Simon et al. 2013) | ✓ | ✗ | ||
SCAD Cox (Fan and Li 2002) SCAD Cox (Fan 和 Li 2002) | Quadratic spline and symmetric penalty 二次样条和对称罚 | ✓ | ✗ | Selection of relatively large effects 选择相对较大的效应 |
SIS Cox (Fan et al. 2010) SIS Cox (Fan et al. 2010) | Top-ranked variables and any penalty 排名靠前的变量和任何惩罚 | ✓ | ✗ | Suited to ultra-high dimensions 适用于超高尺寸 |
Bayesian models: 贝叶斯模型: | Shrinkage prior 收缩先验 | |||
Lee et al. (2011) Lee 等人 (2011) | Lasso (Laplace) prior 套索 (拉普拉斯) prior | ✗ | ✓ | Selection of posterior mean with a cutoff 选择具有截止值的后平均值 |
Lee et al. (2015) Lee 等人 (2015) | Elastic Net, group/fused Lasso priors Elastic Net,分组/融合 Lasso 先验 | ✓ | ✓ | Selection of posterior mean with a cutoff 选择具有截止值的后平均值 |
Konrath et al. (2013) Konrath 等人 (2013) | Spike-and-slab prior 钉板先验 | ✗ | ✓ | |
Madjar et al. (2021) Madjar 等人 (2021) | Spike-and-slab and MRF priors Spike-and-slab 和 MRF 先验 | ✓ | ✓ | |
Mu et al. (2021) Mu 等人 (2021) | Horseshoe prior 马蹄形 | ✗ | ✓ | Selection of posterior mean with a cutoff 选择具有截止值的后平均值 |
Method . | Feature selection via . | Grouping effects . | Uncertainty quantification . | Comment . |
---|---|---|---|---|
Penalized regressions: | Penalty | |||
Lasso Cox (Tibshirani 1997) | ✗ | ✗ | ||
Adaptive Lasso Cox (Zhang and Lu 2007) | Weighted | ✗ | ✗ | Less false positives than Lasso |
Elastic Net Cox (Simon et al. 2011) | ✓ | ✗ | ||
Group-Lasso Cox (Kim et al. 2012) | ✓ | ✗ | Independent groups of features selected | |
Sparse Group-Lasso Cox (Simon et al. 2013) | ✓ | ✗ | ||
SCAD Cox (Fan and Li 2002) | Quadratic spline and symmetric penalty | ✓ | ✗ | Selection of relatively large effects |
SIS Cox (Fan et al. 2010) | Top-ranked variables and any penalty | ✓ | ✗ | Suited to ultra-high dimensions |
Bayesian models: | Shrinkage prior | |||
Lee et al. (2011) | Lasso (Laplace) prior | ✗ | ✓ | Selection of posterior mean with a cutoff |
Lee et al. (2015) | Elastic Net, group/fused Lasso priors | ✓ | ✓ | Selection of posterior mean with a cutoff |
Konrath et al. (2013) | Spike-and-slab prior | ✗ | ✓ | |
Madjar et al. (2021) | Spike-and-slab and MRF priors | ✓ | ✓ | |
Mu et al. (2021) | Horseshoe prior | ✗ | ✓ | Selection of posterior mean with a cutoff |
4.1 Unsupervised learning
4.1 无监督学习
Unsupervised learning methods aim to identify hidden patterns or data groupings, and are for example useful when a phenotype (e.g. a cancer type) is to be divided into several subtypes (e.g. to explain heterogeneity among patients). For example, breast cancer has been traditionally categorized into five conceptual molecular classes, originally using pairwise average-linkage cluster analysis of DNA microarray data, to better understand tumor biology and guide clinical decision making (Perou et al. 2000). Unsupervised methods learn underlying patterns from unlabeled data by transforming high-dimensional omics features into a lower dimensional space. Principal component analysis (PCA) is a classical multivariate technique that represents high-dimensional features in a low-dimensional space by building orthogonal (uncorrelated) linear combinations of the features that best capture the variance in the high-dimensional data.
无监督学习方法旨在识别隐藏的模式或数据分组,例如,当表型(例如癌症类型)要分为几个亚型时(例如,解释患者之间的异质性)时,这种方法非常有用。例如,乳腺癌传统上被分为五个概念分子类别,最初使用 DNA 微阵列数据的成对平均连锁聚类分析,以更好地了解肿瘤生物学并指导临床决策(Perou 等人,2000 年)。无监督方法通过将高维组学特征转换为低维空间,从未标记的数据中学习基本模式。主成分分析 (PCA) 是一种经典的多元技术,它通过构建最能捕获高维数据中方差的特征的正交(不相关)线性组合来表示低维空间中的高维特征。
Different from the distance-based PCA with linear transformation, nonlinear techniques have recently emerged, such as t-stochastic neighbor embedding (t-SNE). t-SNE uses pairwise similarities of individuals based on Kullback–Leibler divergence to project similarities into a lower dimensional space, ensuring that individuals with similar omics features are close in the generated embedding (van der Maaten and Hinton 2008). The focus of t-SNE is on preserving local distances between neighboring data points. Another nonlinear alternative to PCA is UMAP (Uniform Manifold Approximation and Projection), which is a general dimension reduction method built upon Riemannian geometry (McInnes et al. 2018). Compared to other nonlinear methods of dimension reduction such as t-SNE, UMAP can sometimes provide better visualization quality in a shorter amount of time, while preserving the global structure of the omics data better.
与线性变换的基于距离的 PCA 不同,最近出现了非线性技术,例如 t 随机邻域嵌入 (t-SNE)。t-SNE 使用基于 Kullback-Leibler 散度的个体成对相似性,将相似性投影到较低维空间中,确保具有相似组学特征的个体在生成的嵌入中接近(van der Maaten 和 Hinton 2008)。t-SNE 的重点是保持相邻数据点之间的本地距离。PCA 的另一种非线性替代方案是 UMAP(均匀流形近似和投影),这是一种建立在黎曼几何基础上的通用降维方法(McInnes 等人,2018 年)。与其他非线性降维方法(如 t-SNE)相比,UMAP 有时可以在更短的时间内提供更好的可视化质量,同时更好地保留组学数据的全局结构。
Unsupervised methods only make use of the input data matrix
无监督方法仅使用输入数据矩阵
4.2 Supervised learning via penalized regressions
4.2 通过惩罚回归进行监督学习
出于个性化癌症医学的目的,人们通常对从临床和组学特征中识别风险因素组合感兴趣。这些因素可以通过治疗策略(直接或间接)靶向或用于诊断。因此,目标是通过利用大量可用的组学数据等大量信息来识别一组与生存结果相关的简洁特征。惩罚 Lasso Cox 回归 (Tibshirani 1997) 可用于选择一些相关的组学特征,方法是将它们的系数估计为非零(不相关特征的系数缩小到零),通过最大化具有
这里,
其中
There are many alternative penalties that will achieve feature selection which have been applied to Cox proportional hazards regression for survival outcomes, such as the Adaptive Lasso Cox model that incorporates different penalties for different coefficients to retain important variables (Zhang and Lu 2007), Group-Lasso that performs group selection on (predefined) groups of variables (Kim et al. 2012), Sparse Group-Lasso that introduces sparse effects both on a group and within group level (Simon et al. 2013), the smoothly clipped absolute deviation (SCAD) Cox model that overcomes substantially biased estimates for large coefficients in ultra-sparse models (Fan and Li 2002), sure independence screening (SIS) procedure in combination with Cox model that speeds-up the feature selection dramatically and can also improve the accuracy of estimation when dimensionality becomes ultra-high, i.e.
有许多替代惩罚可以实现特征选择,这些惩罚已应用于生存结果的 Cox 比例风险回归,例如自适应套索 Cox 模型,该模型对不同系数进行了不同的惩罚以保留重要变量(Zhang 和 Lu 2007),对(预定义)变量组执行组选择的 Group-Lasso(Kim 等人,2012 年)、在组级别和组内引入稀疏效应的稀疏组-套索(Simon et al. 2013),平滑裁剪绝对偏差 (SCAD) Cox 模型克服了超稀疏模型中大系数的实质性偏差估计(Fan 和 Li 2002),确定独立性筛选 (SIS) 程序与 Cox 模型相结合,大大加快了特征选择的速度,并且还可以提高维数变得超高,即
4.3 Supervised learning via Bayesian priors
4.3 通过贝叶斯先验进行监督学习
Bayesian inference is an appealing approach for survival analysis due to its ability to provide straightforward uncertainty quantification (e.g. credible intervals) of parameters and survival probabilities. For instance, Lee et al. (2011) proposed a Bayesian version of the Lasso Cox (Bayesian Lasso Cox) model that provides credible intervals of coefficients fairly straightforward (see Fig. 4), but it is not easy to derive confidence intervals of coefficients in a Lasso-type model.
贝叶斯推理是一种有吸引力的生存分析方法,因为它能够提供参数和生存概率的直接不确定性量化(例如可信的区间)。例如,Lee 等人(2011 年)提出了套索考克斯(Bayesian Lasso Cox)模型的贝叶斯版本,该模型提供了相当简单的可信系数区间(见图 4),但在套索型模型中推导出系数的置信区间并不容易。
贝叶斯方法的基本定理是贝叶斯规则。设
其中
如果我们分配独立的双指数(也称为拉普拉斯,图 5a)先验,
其中
Note that the Bayesian version of the Lasso with Laplace-type priors in practice do not result in automatic feature selection, because only the posterior modes of the coefficients are equivalent to the frequentist Lasso solution, while in Bayesian inference one usually focuses on the posterior means as point estimates. As an alternative, a particular omics feature can be excluded if the estimated credible interval of the corresponding coefficient covers zero.
请注意,在实践中,具有拉普拉斯类型先验的 Lasso 的贝叶斯版本不会导致自动特征选择,因为只有系数的后验模式等价于频率拉索解,而在贝叶斯推理中,人们通常将后验均值作为点估计。作为替代方案,如果相应系数的估计可信区间覆盖零,则可以排除特定的组学特征。
随机搜索变量选择 (SSVS) 是识别重要协变量的另一种方法(George 和 McCulloch 1993,Konrath 等人,2013 年)。SSVS 使用独立的 spike-and-slab 先验作为回归系数,例如
其中
其中
另一种流行的收缩先验是马蹄形先验,这是一种连续的全局-局部收缩先验,其中全局参数允许在组学特征之间共享信息,而局部参数允许在单个组学特征水平进行调整(Carvalho 等人,2009 年,Mu 等人,2021 年)。在与 (5) 中具有尖峰和平板先验的 Cox 模型类似的设置中,回归系数的马蹄形先验为
其中,局部参数
Although Bayesian models can quantify uncertainty of estimators more straightforward than penalized regressions, most Bayesian Cox-type models for high-dimensional covariates do not have user-friendly and standalone R packages on CRAN or GitHub. The main reason is the high computational cost of running a high-dimensional Bayesian Cox model. Advanced users with programming capabilities can contact the corresponding authors for original scripts. Since Bayesian priors are more flexible than frequentist Lasso-type penalties, it can be easier to extend Bayesian models by changing shrinkage priors while keeping almost the same algorithm framework. This means that the Bayesian framework can be more suitable if one is interested in tailoring the shrinkage effects, e.g. to include prior knowledge about the importance of omics features, e.g. features corresponding to a molecular pathway that is known to be affected in the disease under study. For more information on different Bayesian priors in cancer prognosis, we suggest a recent review by Chu et al. (2022) which summarized other different shrinkage priors on regression coefficients, such as Gaussian-gamma, Gaussian, Cauchy, pMOM (product moment distribution), piMOM (product inverse moment distribution) and peNMIG (parameter-expanded normal-mixture-of-inverse-gamma distribution) priors.
尽管贝叶斯模型可以比惩罚回归更直接地量化估计量的不确定性,但大多数用于高维协变量的贝叶斯 Cox 型模型在 CRAN 或 GitHub 上没有用户友好的独立 R 包。主要原因是运行高维 Bayesian Cox 模型的高计算成本。具有编程能力的高级用户可以联系相应的作者获取原始脚本。由于贝叶斯先验比频率 Lasso 类型的惩罚更灵活,因此在保持几乎相同的算法框架的同时,通过更改收缩先验来扩展贝叶斯模型会更容易。这意味着如果有人对定制收缩效应感兴趣,例如包括关于组学特征重要性的先验知识,例如对应于已知在所研究疾病中受影响的分子途径的特征,贝叶斯框架可能更合适。有关癌症预后中不同贝叶斯先验的更多信息,我们建议 Chu 等人(2022 年)最近的一篇综述,该综述总结了回归系数的其他不同收缩先验,例如高斯伽马、高斯、柯西、pMOM(乘积矩分布)、piMOM(乘积逆矩分布)和 peNMIG(参数扩展的正态混合逆伽马分布)先验。
5 Survival model validation
5 生存模型验证
Model validation plays an important role in identifying potential issues, such as model misspecification or overfitting. This is achieved by revisiting the model’s specifications and assumptions following model estimation. For example, the Cox model (1) requires proportional hazards and the logarithm of the hazard to be linear with respect to the model covariates. The former assumption can for example be checked by the cumulative Schoenfeld residuals (Grambsch and Therneau 1994), and the latter assumption can be checked by plotting a nonlinear functional form (e.g. spline) for the effect of a covariate. If the Cox model assumptions are not satisfied, one can try certain transformations of covariates (e.g. Box-Cox power transformations, Box and Cox 1964), allow time-varying coefficients or model interactions among covariates (Ng et al. 2023), or investigate patient stratification using unsupervised approaches (Cristescu et al. 2015). However, the assumption checks are usually suitable only for low-dimensional models, i.e. for a few clinical variables or a few factors projected from the high-dimensional omics feature space. Novel approaches for assumption checks in general high-dimensional settings require further methodological developments. Johnson and Long (2011) used heuristic methods to investigate the Cox model assumptions by separately fitting univariate Cox models one feature at one time, and check P-values for the score tests of individual features and P-values for testing the proportional hazards assumption of univariate Cox models. But the univariate Cox models do not take into account confounding variables. An alternative approach is to loosen the model assumptions for a more robust modeling approach. One example developed specifically for feature selection under possibly nonproportional hazards in a high-dimensional space is concordance regression (Dunkler et al. 2010).
模型验证在识别潜在问题(例如模型错误指定或过度拟合)方面起着重要作用。这是通过在模型估计后重新审视模型的规范和假设来实现的。例如,Cox 模型 (1) 要求比例风险和风险的对数相对于模型协变量呈线性。例如,前一个假设可以通过累积 Schoenfeld 残差 (Grambsch and Therneau 1994) 来检查,后一个假设可以通过绘制协变量效应的非线性函数形式(例如样条)来检查。如果不满足 Cox 模型假设,可以尝试协变量的某些变换(例如 Box-Cox 幂变换、Box 和 Cox 1964),允许协变量之间的时变系数或模型交互(Ng 等人,2023 年),或使用无监督方法研究患者分层(Cristescu 等人,2015 年).然而,假设检查通常仅适用于低维模型,即从高维组学特征空间投射的少数临床变量或少数因素。在一般高维设置中进行假设检查的新方法需要进一步的方法论发展。Johnson 和 Long (2011) 使用启发式方法通过一次单独拟合单变量 Cox 模型一个特征来研究 Cox 模型假设,并检查单个特征的分数测试的 P 值和用于测试单变量 Cox 模型的比例风险假设的 P 值。但是单变量 Cox 模型没有考虑混杂变量。 另一种方法是放宽模型假设,以获得更稳健的建模方法。专门为高维空间中可能的非比例风险下的特征选择开发的一个例子是一致性回归(Dunkler 等人,2010 年)。
5.1 Feature stability analysis
5.1 特征稳定性分析
One important aspect in model validation when using omics or other high-dimensional data is the potential instability of feature selection (Kalousis et al. 2007). Feature selection using penalized regressions as described in Section 4.2 heavily depends on the values of the penalty parameters [e.g. for the
使用组学或其他高维数据时,模型验证的一个重要方面是特征选择的潜在不稳定性(Kalousis 等人,2007 年)。如第 4.2 节所述,使用惩罚回归的特征选择在很大程度上取决于惩罚参数的值 [例如,对于 Lasso Cox 模型 (2) 中的参数]。
For the Bayesian models in Section 4.3, feature selection stability is naturally assessed by the uncertainty of coefficients’ estimators, as reflected in the posterior variances of
对于第 4.3 节中的贝叶斯模型,特征选择稳定性自然是通过系数估计量的不确定性来评估的,这反映在
5.2 Survival prediction and calibration
5.2 生存预测和校准
The fundamental goal of any statistical prediction model is to achieve a better prediction performance than an existing statistical model which we could call the ‘conventional model’ (Gerds and Kattan 2021). The special nature of the combination of clinical and/or other known prognostic factors (typically low-dimensional, with established effect) and novel omics features (high-dimensional, with unknown effect) should especially be taken into account. Therefore, a survival model consisting of only established clinical and/or other known prognostic factors should serve as the benchmark (i.e. conventional model) for the upcoming modeling. The inclusion of new covariates (e.g. omics features) into a prognostic model only makes sense if the new covariates have added prognostic value over the established clinical prognostic factors (De Bin et al. 2014), i.e. the new prognostic model consisting of the new covariates plus benchmark covariates improves the prediction performance over the conventional model.
任何统计预测模型的基本目标都是实现比现有统计模型更好的预测性能,我们可以称之为 “传统模型”(Gerds 和 Kattan 2021)。应特别考虑临床和/或其他已知预后因素(通常是低维的,具有既定效果)和新的组学特征(高维,效果未知)组合的特殊性质。因此,仅由已建立的临床和/或其他已知预后因素组成的生存模型应作为即将进行的建模的基准(即 常规模型)。只有当新的协变量在已建立的临床预后因素上增加了预后价值时,将新的协变量(例如组学特征)纳入预后模型才有意义(De Bin 等人,2014 年),即由新协变量和基准协变量组成的新预后模型提高了预测性能优于传统模型。
To confirm that the identified clinical and omics features have prognostic power with respect to the prediction of patients’ survival outcomes, a model should be both accurate (low prediction error) and precise (low prediction uncertainty). The simplest way to demonstrate the prognostic power is to dichotomize the prognostic scores [i.e. linear predictor
为了确认已确定的临床和组学特征在预测患者生存结果方面具有预后能力,模型应同时准确(低预测误差)和精确(低预测不确定性)。证明预后能力的最简单方法是将预后评分 [即 Cox 模型中的线性预测因子
To validate a prediction model systematically (Rahman et al. 2017, Royston and Altman 2013), the predictive performance of the model is commonly addressed by
为了系统地验证预测模型(Rahman 等人,2017 年,Royston 和 Altman 2013 年),模型的预测性能通常由
discrimination: the ability of the model to distinguish between low and high risk patients,
鉴别力:模型区分低风险和高风险患者的能力,calibration: the agreement between the observed outcomes and predicted survival probabilities, and
校准:观察到的结果与预测的生存概率之间的一致性,以及overall performance: the distance between the observed and predicted survival probabilities.
总体性能:观测到的生存概率和预测的生存概率之间的距离。
5.2.1 Discrimination performance
5.2.1 鉴别表现
如果专注于固定时间点的生存预测(例如 5 年生存概率),则可以使用受试者工作特征 (ROC) 曲线来评估生存模型的预后(即预测或鉴别生存)能力,通常用其 ROC 曲线下面积 (AUC) 来总结(Heagerty 等人,2000 年),见图 7a例如。AUC 为 0.5 相当于抛硬币,AUC 越接近 1,模型的预测性就越强。当在多个时间点进行预测时,ROC 曲线可以概括为时间依赖性的 AUC 分数,即在预先指定的时间点计算的 AUC 分数。或者,一致性指数 (C-index) 对预后模型的鉴别能力提供了更全面的、时间独立的评估,因此更好的模型对生存时间较短的患者预测了更高的预后评分(Harrell 等人,1982 年,Antolini 等人,2005 年),即
这意味着在没有删失的情况下
In the classical Cox modeling framework, both Harrell’s and Uno’s C-indices only depend on the linear predictors
在经典的 Cox 建模框架中,Harrell 和 Uno 的 C 指数都只依赖于线性预测变量
5.2.2 Calibration performance
5.2.2 校准性能
校准是为了量化观察到的结果和预测的结果之间的一致性,这对于内部和外部模型验证都很有用,建议定期报告。通常使用校准斜率 (van Houwelingen 2000),它是观测/实际生存概率对模型预测生存概率的回归斜率。生存模型可以用预定义的子组中估计的 t 年生存概率来报告,表示为
其中
As an alternative to the calibration slope at a single time point, Andres et al. (2018) and Haider et al. (2020) suggested the distributional calibration (D-Calibration) for accounting survival probabilities across all time points. This can be useful when assessing the entire post-treatment survival prediction, e.g. assessing post liver transplantation survival utility in Andres et al. (2018).
作为单个时间点校准斜率的替代方案,Andres 等人 (2018) 和 Haider 等人 (2020) 提出了分布校准 (D-Calibration) 来计算所有时间点的生存概率。这在评估整个治疗后生存预测时很有用,例如 Andres 等人 (2018) 评估肝移植后生存效用。
5.2.3 Overall performance
5.2.3 整体性能
Scoring rules can evaluate the accuracy and confidence of probabilistic predictions, and assess both discrimination and calibration (Gneiting and Raftery 2007, Avati et al. 2020). The idea of scoring rules dates back to Brier (1950) which assigned a numerical score for verifying ensemble-based probabilistic predictions of discrete outcomes.
评分规则可以评估概率预测的准确性和置信度,并评估判别和校准(Gneiting 和 Raftery 2007,Avati 等人,2020 年)。评分规则的想法可以追溯到 Brier (1950),他分配了一个数字分数来验证基于集成的离散结果的概率预测。
Graf 等人 (1999) 提出了时间依赖性 Brier 评分,它是不同时间点生存概率预测的预期均方误差,即
其中
To overcome potential overfitting when using feature selection and model estimation, the survival predictions and model calibration should be evaluated in independent validation datasets. As independent validation data are seldom available, we can split the available data into training and validation datasets. However, any single split of the data hides partial data from all steps of model building, which might introduce bias, it is recommended to use resampling-based methods for assessing the survival model’s performance. In addition, using resampling-based methods also allow us to estimate the uncertainty of the performance estimator (Sill et al. 2014; Gerds and Kattan 2021, chapter 7). This can be done for example by repeatedly splitting the data to training and validation sets, and evaluating a survival model’s performance on the different validation sets using various discrimination or calibration indices. The 0.632+ bootstrap estimator for a discrimination or calibration index can balance the apparent (training) error and the average out-of-bag bootstrap error, and in addition accounts for the relative overfitting based on a no-information error rate in high-dimensional settings (Schumacher et al. 2007, Binder and Schumacher 2008a). This is a typical machine learning approach with two levels of resampling. The outer layer of resampling is to evaluate the prediction performance and the inner layer of resampling (usually CV) is to optimize model’s tuning parameters. We note that the preparatory steps, such as multi-modal data standardization and feature pre-selection in the context of high-dimensional input data, may affect the survival prediction performance, and thus should be included in the resampling steps for model validation.
为了克服使用特征选择和模型估计时可能出现的过拟合,应在独立的验证数据集中评估生存预测和模型校准。由于很少有独立的验证数据,我们可以将可用数据拆分为训练数据集和验证数据集。但是,数据的任何一次拆分都会隐藏模型构建所有步骤的部分数据,这可能会引入偏差,建议使用基于重采样的方法来评估生存模型的性能。此外,使用基于重采样的方法还允许我们估计性能估算器的不确定性(Sill 等人,2014 年;Gerds 和 Kattan 2021 年,第 7 章)。例如,可以通过将数据重复拆分为训练集和验证集,并使用各种判别或校准指数评估生存模型在不同验证集上的性能来实现。判别或校准指数的 0.632+ bootstrap 估计器可以平衡表观(训练)误差和平均袋外 bootstrap 误差,此外还可以解释基于高维设置中无信息误差率的相对过拟合(Schumacher et al. 2007, Binder and Schumacher 2008a).这是一种典型的机器学习方法,具有两个级别的重新采样。重采样的外层是评估预测性能,重采样的内层(通常是 CV)是优化模型的调优参数。 我们注意到,在高维输入数据背景下的多模态数据标准化和特征预选等准备步骤可能会影响生存预测性能,因此应包含在模型验证的重采样步骤中。
5.2.4 Graphical representation
5.2.4 图形表示
After confirming that a model is valid (assumptions hold), accurate (low prediction error), precise (uncertainty of performance measures properly quantified) and its predictions generalizable beyond the training dataset (using independent validation data if available), a prognostic nomogram (Kattan et al. 1999) can be used to summarize the prognostic effect of the identified clinical and omics features on the risk of a specific year’s overall survival (Fig. 6), which may help the clinicians to enhance the patient management and personalized treatment strategies. For example, the red colored dots in Fig. 6 show the information of the identified five variables from an example patient and the corresponding scoring points. The summed scoring points of 263 maps to the predicted 1-year, 3-year and 5-year survival probabilities of this patient. Note that most nomograms treat the identified variables independently in the risk calculation, even though there may be significant interactions among the model features that were used in the feature selection step. However, visualizing such interaction effects would make the nomograms less accessible and interpretable, and so, there is still a room for improvement in how to translate the multivariate risk scores into clinical practice.
在确认模型有效(假设成立)、准确(低预测误差)、精确(正确量化性能测量的不确定性)及其预测可在训练数据集之外推广(使用独立的验证数据,如果可用)后,可以使用预后列线图(Kattan 等人,1999 年)来总结已确定的临床和组学特征对特定年份总生存期风险的预后影响(图 6),这可能有助于临床医生加强患者管理和个性化治疗策略。例如,图 6 中的红色点显示了从示例患者中识别出的 5 个变量的信息以及相应的评分点。263 的总得分点映射到该患者预测的 1 年、3 年和 5 年生存概率。请注意,大多数列线图在风险计算中独立处理已识别的变量,即使在特征选择步骤中使用的模型特征之间可能存在显著的交互。然而,可视化这种交互效应会使列线图更难获得和解释,因此,在如何将多变量风险评分转化为临床实践方面仍有改进的空间。
When an independent validation dataset is available, it is recommended to report a calibration plot corresponding to the nomogram. Using independent validation data to obtain
当有独立的验证数据集可用时,建议报告与列线图对应的校准图。使用独立验证数据得到
6 Beyond penalized and Bayesian Cox models
6 超越罚函数模型和贝叶斯 Cox 模型
In this tutorial, we mainly focused on penalized regressions and Bayesian hierarchical models in the Cox proportional hazards framework. One can extend this framework in several ways. For instance, one can stay in a likelihood-based modeling framework, but replace the partial likelihood function of the semi-parametric Cox model by alternative likelihood functions (which do not necessarily need to imply proportional hazards), e.g. parametric survival models like exponential, Weibull, or accelerated failure time (AFT) models, or Aalen’s additive hazard model (Gorst-Rasmussen and Scheike 2012). Alternatively, one can move to a more algorithmic machine learning approach, such as tree-based boosting or bagging methods, e.g. random survival forests (Hothorn et al. 2006, Binder and Schumacher 2008b, Jaeger et al. 2019), or (deep) neural networks (Wiegrebe et al. 2024).
在本教程中,我们主要关注 Cox 比例风险框架中的惩罚回归和贝叶斯分层模型。可以通过多种方式扩展此框架。例如,可以停留在基于似然的建模框架中,但用替代似然函数(不一定需要暗示比例风险)替换半参数 Cox 模型的偏似然函数,例如指数、Weibull 或加速失效时间 (AFT) 模型等参数生存模型,或 Aalen 的加性风险模型(Gorst-Rasmussen 和 Scheike 2012).或者,人们可以转向更具算法性的机器学习方法,例如基于树的提升或装袋方法,例如随机生存森林(Hothorn 等人,2006 年,Binder 和 Schumacher 2008b,Jaeger 等人,2019 年)或(深度)神经网络(Wiegrebe 等人,2024 年)。
Hothorn et al. (2006) introduced ensemble tree methods for analyzing right-censored survival data, which construct ensembles from base learners, e.g. binary survival trees for each omics feature. Hothorn et al. (2006) also proposed a gradient boosting algorithm to predict the survival time of patients with acute myeloid leukemia (AML), based on clinical and omics features. Similarly, Binder and Schumacher (2008b) developed a likelihood-based boosting method, which aims to maximize the Cox partial likelihood function, for modeling time-to-event data based on high-dimensional omics input data and which also allows the inclusion of a small number of mandatory covariates. In general, one needs to be cautious if using some machine learning methods that are not well-suited for high-dimensional features. For example, Kvamme et al. (2019) proposed extensions of the Cox model with neural networks, which are only valid if the number of covariates is smaller than the number of samples, i.e. if
Hothorn et al. (2006) 引入了用于分析右删失生存数据的集合树方法,该方法从基本学习器构建集合,例如每个组学特征的二元生存树。Hothorn 等人 (2006) 还提出了一种梯度提升算法,以根据临床和组学特征预测急性髓性白血病 (AML) 患者的生存时间。同样,Binder 和 Schumacher (2008b) 开发了一种基于似然的提升方法,旨在最大化 Cox 偏似然函数,用于基于高维组学输入数据对事件发生时间数据进行建模,并且还允许包含少量的强制性协变量。一般来说,如果使用一些不太适合高维特征的机器学习方法,则需要谨慎。例如,Kvamme 等人(2019 年)提出了使用神经网络对 Cox 模型进行扩展,只有当协变量的数量小于样本的数量时,即如果
In the case of nonproportional hazards, many likelihood-based survival models beyond Cox-type models have also been extended to account for high-dimensional omics as input data. For example, Ma et al. (2006) combined Lin and Yin’s additive hazard model (Lin and Ying 1994) with principal component regression for dimension reduction of omics features, which was applied to the study of gene expression-based survival prediction for diffuse large B-cell lymphoma. Engler and Li (2009) added the Elastic Net penalty in an accelerated ATF model, which assumes that the effect of a covariate accelerates or decelerates the life course of patients. Schmid and Hothorn (2008) and Barnwal et al. (2022) used boosting algorithms to learn parametric AFT models. Ha et al. (2014) considered the Lasso, SCAD and a penalized h-likelihood for feature selection in frailty models which assume that individuals have unobserved heterogeneity captured by a latent random term Z, which adapts the Cox model (1) into
在非比例风险的情况下,除了 Cox 型模型之外,许多基于可能性的生存模型也被扩展为将高维组学作为输入数据。例如,马等人(2006)将Lin和Yin的加性危害模型(Lin和Ying 1994)与组学特征降维的主成分回归相结合,该模型应用于弥漫性大B细胞淋巴瘤的基于基因表达的生存预测研究。Engler 和 Li (2009) 在加速 ATF 模型中添加了弹性网惩罚,该模型假设协变量的影响会加速或减慢患者的生命进程。Schmid 和 Hothorn (2008) 以及 Barnwal 等人 (2022) 使用提升算法来学习参数化 AFT 模型。Ha 等人(2014 年)考虑了虚弱模型中特征选择的套索、SCAD 和惩罚 h 似然,这些模型假设个体具有由潜在随机项 Z 捕获的未观察到的异质性,该模型将 Cox 模型 (1) 改编为
6.1 Advanced survival models: cure models, competing risks and multi-state models
6.1 高级生存模型:治愈模型、竞争风险和多状态模型
In some situations, survival data may be different from Fig. 1c (also Section 2.1), where it was presumed that all individuals will eventually experience the event of interest. Liu et al. (2012) studied the Lasso and SCAD feature selection for the proportional hazard mixture cure model, in which a certain fraction of individuals will never experience the event of interest. Tapak et al. (2015) investigated Lasso, Elastic Net and likelihood-based boosting for microarray-based survival modeling with competing risks, such as ‘progression’ versus ‘death from noncancer cause’, i.e. the event of a patient can occur due to one of multiple distinct causes. There is a growing awareness of the impact of competing risks when developing prognostic models with high-dimensional input data, e.g. Binder et al. (2009), Ambrogi and Scheike (2016) and Fu et al. (2017). For a single individual who can experience several possible events, Dutta et al. (2017) proposed a multi-state model to identify risk factors in different stages of disease based on high-dimensional input data.
在某些情况下,生存数据可能与图 1c(也是第 2.1 节)不同,其中假设所有个体最终都会经历感兴趣的事件。Liu et al. (2012) 研究了比例风险混合治愈模型的 Lasso 和 SCAD 特征选择,其中一部分人永远不会经历感兴趣的事件。Tapak 等人(2015 年)研究了套索、弹性网和基于可能性的提升,用于基于微阵列的生存建模,具有竞争风险,例如“进展”与“非癌症原因死亡”,即患者的事件可能是由于多种不同原因之一而发生的。在开发具有高维输入数据的预后模型时,人们越来越意识到竞争风险的影响,例如 Binder 等人(2009 年)、Ambrogi 和 Scheike (2016 年)以及 Fu 等人(2017 年)。对于可能经历多种可能事件的单个个体,Dutta 等人(2017 年)提出了一种多状态模型,以基于高维输入数据识别疾病不同阶段的风险因素。
7 Toward single-cell data analysis
7 迈向单细胞数据分析
The cellular heterogeneity of complex sample mixtures pose challenges and also opportunities for precision medicine and survival prediction. For example, Zhou et al. (2019) showed that tumor microenvironment-related gene expression signatures do not only accurately predict the survival among colon cancer patients, but also serve as biomarkers for identifying patients who could potentially benefit from adjuvant chemotherapy. Single-cell technologies provide an unprecedented opportunity for dissecting the interplay between the cancer cells and the associated tumor microenvironment, and the produced high-dimensional data should also augment existing survival modeling approaches. The emerging single-cell atlases are providing a detailed and quantitative overview of tissue composition and organization, and will advance both biomedical research and clinical practice (Elmentaite et al. 2022).
复杂样品混合物的细胞异质性为精准医学和生存预测带来了挑战和机遇。例如,周 et al. (2019) 表明,肿瘤微环境相关的基因表达特征不仅可以准确预测结肠癌患者的生存率,还可以作为识别可能从辅助化疗中受益的患者的生物标志物。单细胞技术为剖析癌细胞与相关肿瘤微环境之间的相互作用提供了前所未有的机会,生成的高维数据也应增强现有的生存建模方法。新兴的单细胞图谱提供了组织组成和组织的详细和定量概述,并将推进生物医学研究和临床实践(Elmentaite 等人,2022 年)。
Currently, the focus of statistical model development for single-cell data analysis is to understand cell type composition and its impact on gene regulation and transcriptional dynamics, usually based on only a small number of samples/individuals. On the one hand, the underlying statistical models for single-cell data analysis are still in development and continuously being re-evaluated and challenged (Kharchenko 2021). On the other hand, survival analysis tackles disease and health contexts at the individual level, and usually requires a relatively large number of individuals for sufficient statistical power. In particular, the development of improved computational methods is urgently needed to enable the consideration of multiple layers (e.g. individual-level, cellular-level, molecular-level) of information when integrating groups of individuals and omics data from a variety of molecular modalities. Therefore, current approaches often map the findings from single-cell omics data to large-scale bulk sequencing omics and survival data, rather than jointly analyzing single-cell omics and survival data. For example, Guo et al. (2018) introduced a generalizable approach to first study T-cells in 14 nonsmall cell lung cancer (NSCLC) patients and to identify gene signatures from the tumor-enriched T cell clusters, and then investigated these gene signatures using bulk RNA-seq and survival data from larger TCGA-NSCLC patient cohort. Similarly, Zhang et al. (2020) developed a scRNA-seq-based approach to reconstruct a multilayer signaling network based on 16 glioma patients, and then investigated the network genes using survival data from TCGA Chinese glioma genome atlas (TCGA-CGGA) patients. However, direct joint analysis of survival and single-cell omics data from multiple cellular hierarchies requires further methodological developments and new statistical and machine learning methods.
目前,单细胞数据分析统计模型开发的重点是了解细胞类型组成及其对基因调控和转录动力学的影响,通常仅基于少量样本/个体。一方面,单细胞数据分析的基础统计模型仍在开发中,并不断受到重新评估和挑战 (Kharchenko 2021)。另一方面,生存分析在个体层面处理疾病和健康背景,通常需要相对大量的个体才能获得足够的统计能力。特别是,迫切需要开发改进的计算方法,以便在整合来自各种分子模式的个体群体和组学数据时能够考虑多层(例如个体水平、细胞水平、分子水平)的信息。因此,目前的方法往往将单细胞组学数据的结果映射到大规模批量测序组学和生存数据,而不是联合分析单细胞组学和生存数据。例如,Guo 等人 (2018) 引入了一种通用方法,首先研究 14 名非小细胞肺癌 (NSCLC) 患者的 T 细胞,并从富含肿瘤的 T 细胞簇中识别基因特征,然后使用大量 RNA-seq 和来自更大 TCGA-NSCLC 患者队列的生存数据来研究这些基因特征。同样,Zhang et al. (2020) 开发了一种基于 scRNA-seq 的方法,基于 16 名神经胶质瘤患者重建多层信号网络,然后使用来自 TCGA 中国神经胶质瘤基因组图谱 (TCGA-CGGA) 患者的生存数据研究网络基因。 然而,对来自多个细胞层次结构的存活和单细胞组学数据进行直接联合分析需要进一步的方法学发展以及新的统计和机器学习方法。
8 Discussion 8 讨论
Although survival analysis faces many modeling challenges, mainly due to censored outcomes, it represents a well-established methodology for finding risk factors associated with patients’ survival. The identification of omics biomarkers for survival prognosis may provide systematic means to guide patient management and personalized treatment and diagnostic strategies. In this tutorial, we provided a comprehensive workflow for survival analysis with high-dimensional omics and standard clinical data, with a specific focus on feature selection of survival-associated omics features and survival model validation. We covered many penalized regressions and Bayesian models for feature selection and survival prediction, accounting for their specific assumptions and applications. Examples of real data and R scripts have been made available to illustrate the use of different methods, which should help researchers to choose and apply suitable methods for their survival analysis applications (https://ocbe-uio.github.io/survomics). We note that this review only considers methods for right-censored time-to-event data, i.e. where all individuals are assumed to be followed continuously from the start of the study, but where the follow-up period might end before the event (e.g. death) was observed. Other types of censoring include interval censoring and left truncation, and appropriate statistical methods dealing with these censoring patterns should be chosen accordingly.
尽管生存分析面临许多建模挑战,主要是由于删失结果,但它代表了一种发现与患者生存相关的风险因素的成熟方法。确定组学生物标志物的生存预后可能提供系统性的手段来指导患者管理以及个性化治疗和诊断策略。在本教程中,我们提供了一个全面的生存分析工作流程,包括高维组学和标准临床数据,特别关注生存相关组学特征的特征选择和生存模型验证。我们介绍了许多用于特征选择和生存预测的惩罚回归和贝叶斯模型,并考虑了它们的具体假设和应用。已经提供了真实数据和 R 脚本的示例来说明不同方法的使用,这应该有助于研究人员为他们的生存分析应用程序 (https://ocbe-uio.github.io/survomics) 选择和应用合适的方法。我们注意到,本综述仅考虑了右删失事件发生时间数据的方法,即假设所有个体从研究开始就被连续跟踪,但随访期可能在观察到事件(例如死亡)之前结束。其他类型的删失包括区间删失和左截断,应相应地选择处理这些删失模式的适当统计方法。
Most of the current methods for survival analysis do not explicitly take into account the complex structures within and between multi-omics data, such as gene regulation and DNA-protein interactions. Regulatory networks constructed either based on prior biological knowledge or using data-driven, yet biologically explainable approaches, may help establish useful methodologies for survival analysis that are more effective for deriving biological insights as well as enable improved clinical translation. However, to achieve a comprehensive and biologically meaningful integration of high-dimensional multi-omics data, there is a need for continued development of computational and statistical approaches that consider both technical and biological intricacies of the data and technologies, respectively (Wissel et al. 2023). This is currently a very active research field, and we expect to see many improved multi-omics methods for survival prediction in the future.
目前的大多数生存分析方法都没有明确考虑多组学数据内部和之间的复杂结构,例如基因调控和 DNA-蛋白质相互作用。基于先前的生物学知识或使用数据驱动但生物学上可解释的方法构建的监管网络可能有助于建立有用的生存分析方法,这些方法更有效地获得生物学见解并能够改进临床转化。然而,为了实现高维多组学数据的全面和生物学意义的整合,需要继续开发计算和统计方法,分别考虑数据和技术的技术和生物学复杂性(Wissel 等人,2023 年)。这是目前一个非常活跃的研究领域,我们预计未来会看到许多改进的多组学生存预测方法。
Another limitation of most of the reviewed methods is that they identify omics features prognostic of survival, but they cannot determine causal relationships. Causality is a fundamental notion to understand omics features causing disease progression, which will allow one to reliably intervene omics features for targeted therapies. There are two popular causal inference models, Pearl’s structural causal model (SCM) and Rubin’s causal model (RCM), both of which introduce perturbations to draw causal inference. Farooq et al. (2023) utilized SCM-based causal discovery approaches to unravel relationships between omics features and survival of breast cancer patients. However, to identify reliable causal relations for clinical applications, laboratory-based experiments, e.g. clustered regularly interspaced short palindromic repeats (CRISPR) techniques (Wang and Doudna 2023), are often necessary to verify the functional relevance of the identified omics features. High-dimensional RCM-based mediation analysis has been used to investigate the indirect effect transmitted by omics features between an exposure and survival outcomes (Song et al. 2020, 2021). Causal mediation analysis is an important tool, which considers the problem of decomposing the causal effect of treatment/exposure into direct and indirect effects (Lange and Hansen 2011, VanderWeele 2011). The direct effect corresponds to the effect of a treatment directly on the survival outcome, while an indirect effect corresponds to the effect of a treatment on the outcome that is due to its effect on an intermediate variable (e.g. gene expression) that also has a causal effect on the survival outcome. Targeted learning also fills a much needed gap between statistical modeling and causal inference (van der Laan and Rose 2011, 2018). Tuglus and van der Laan (2011) used targeted maximum likelihood estimation to provide an interpretable causal measure of variable importance for the discovery of biomarkers and omics features. Another way to formalize personalized medicine is dynamic treatment regimes (Chakraborty and Moodie 2013, Tsiatis et al. 2019, Deliu et al. 2023) that encompasses causal inference and takes into account for the variability in omics, environment and lifestyle factors for each individual to improve the treatment of a particular patient. However, all the causal machine learning methods require further methodological developments for adaptation to survival modeling with high-dimensional input data.
大多数综述方法的另一个局限性是它们识别了生存预后的组学特征,但它们无法确定因果关系。因果关系是理解导致疾病进展的组学特征的基本概念,这将使人们能够可靠地干预组学特征以进行靶向治疗。有两种流行的因果推理模型,Pearl 的结构因果模型 (SCM) 和 Rubin 的因果模型 (RCM),这两种模型都引入了扰动来得出因果推理。Farooq 等人(2023 年)利用基于 SCM 的因果发现方法来揭示组学特征与乳腺癌患者生存之间的关系。然而,为了确定临床应用的可靠因果关系,通常需要基于实验室的实验,例如成簇的规则间隔短回文重复序列 (CRISPR) 技术 (Wang 和 Doudna 2023),以验证已确定的组学特征的功能相关性。基于高维 RCM 的中介分析已被用于研究组学特征在暴露和生存结果之间传递的间接影响(Song 等人,2020 年,2021 年)。因果中介分析是一个重要的工具,它考虑了将治疗/暴露的因果效应分解为直接和间接效应的问题(Lange 和 Hansen 2011,VanderWeele 2011)。直接效应对应于治疗对生存结果的直接影响,而间接效应对应于治疗对结果的影响,这是由于其对中间变量(例如 基因表达)的影响,该变量也对生存结果有因果影响。 有针对性的学习还填补了统计建模和因果推理之间急需的空白(van der Laan 和 Rose 2011,2018)。 Tuglus 和 van der Laan (2011) 使用目标最大似然估计为生物标志物和组学特征的发现提供了可变重要性的可解释因果测量。另一种将个性化医疗正规化的方法是动态治疗方案(Chakraborty 和 Moodie 2013,Tsiatis 等人,2019 年,Deliu 等人,2023 年),它包含因果推断,并考虑到每个人的组学、环境和生活方式因素的可变性,以改善对特定患者的治疗。然而,所有因果机器学习方法都需要进一步的方法论发展,以适应使用高维输入数据的生存建模。
Acknowledgements 确认
The authors thank Professor Ørnulf Borgan for his valuable suggestions and comments.
作者感谢 Ørnulf Borgan 教授的宝贵建议和评论。
Conflict of interest 利益冲突
None declared. 没有人宣布。
Funding 资金
This work was supported by grants from Helse Sør-Øst [2020026 to T.A.], the Norwegian Cancer Society [216104 to T.A.], the Radium Hospital Foundation, the Academy of Finland [326238, 340141, 344698, 345803 to T.A.], Cancer Society of Finland [to T.A.], the European Union’s Horizon 2020 research and innovation programme [‘PANCAIM’ 101016851 to J.Z. and T.A., ‘RESCUER’ 847912 to M.Z.], the Innovative Medicines Initiative 2 Joint Undertaking of the European Union’s Horizon 2020 research and innovation programme and EFPIA and JDRF INTERNATIONAL [‘imSAVAR’ 853988 to M.Z.], and ERA PerMed under the ERA-NET Cofund scheme of the European Union’s Horizon 2020 research and innovation framework programme [‘SYMMETRY’ 342752 to M.Z.].
这项工作得到了 Helse Sør-Øst [2020026 to TA]、挪威癌症协会 [216104 to TA]、镭医院基金会、芬兰科学院 [326238、340141、344698、345803 to TA]、芬兰癌症协会 [to TA]、欧盟地平线 2020 研究和创新计划 ['PANCAIM' 101016851 to J.Z. 和 T.A., “RESCUEER”847912 M.Z.]、欧盟 Horizon 2020 研究和创新计划的创新药物倡议 2 联合项目以及 EFPIA 和 JDRF INTERNATIONAL ['imSAVAR' 853988 M.Z.],以及欧盟 Horizon 2020 研究和创新框架计划的 ERA-NET 共同基金计划下的 ERA PerMed ['SYMMETRY' 342752 M.Z.]。
Data availability 数据可用性
Supplementary step-by-step R tutorial is available online at https://ocbe-uio.github.io/survomics. TCGA data are publicly available at https://portal.gdc.cancer.gov.
补充的分步 R 教程可在 https://ocbe-uio.github.io/survomics 在线获取。TCGA 数据在 https://portal.gdc.cancer.gov 上公开提供。