Construction and analysis of a lysosome-dependent cell death score-based prediction model for non-small cell lung cancer 基于溶酶体依赖性细胞死亡评分的非小细胞肺癌预测模型的构建与分析
Background Non-small cell lung cancer (NSCLC) is the most common type of tumor globally and the leading cause of cancer-related deaths. Although treatment strategies such as immune checkpoint inhibitors and chemotherapy have advanced, the heterogeneity among NSCLC patients results in significant variability in treatment outcomes. Studies have shown that certain patients respond poorly to immune checkpoint inhibitors, indicating that treatment response is closely related to multiple factors. Therefore, it is necessary to develop predictive models to stratify patients based on gene expression and clinical characteristics, aiming for precision therapy. Objective This study aims to construct a stratified prognostic model for NSCLC patients based on lysosome-dependent cell death (LDCD) scoring by integrating single-cell RNA sequencing (scRNA-seq) and bulk RNA sequencing data. By analyzing the immune-related characteristics of high-risk and low-risk groups, we further explored the impact of cell death patterns on lung cancer and identified potential therapeutic targets. Methods This study obtained single-cell RNA sequencing data and gene expression data of NSCLC patients and normal lung tissues from the GEO and TCGA databases. We used RR packages such as Seurat and CellChat for data preprocessing and analysis, and performed dimensionality reduction and visualization through Principal Component Analysis (PCA) and UMAP algorithms. LASSO regression analysis was used to construct the predictive model, followed by cross-validation and ROC curve analysis. The model’s effectiveness was validated through survival analysis and immune microenvironment analysis. Results The study showed a significant increase in the proportion of monocytes in NSCLC tissues, suggesting their important role in cancer progression. Cell communication analysis indicated that macrophages, smooth muscle cells, and myeloid cells exhibit strong intercellular communication during cancer progression. Using the constructed prognostic 背景 非小细胞肺癌 (NSCLC) 是全球最常见的肿瘤类型,也是癌症相关死亡的主要原因。尽管免疫检查点抑制剂和化疗等治疗策略已经取得进展,但 NSCLC 患者之间的异质性导致治疗结果的显著差异。研究表明,某些患者对免疫检查点抑制剂反应不佳,表明治疗反应与多种因素密切相关。因此,有必要开发预测模型,根据基因表达和临床特征对患者进行分层,以实现精准治疗。目的 本研究旨在通过整合单细胞 RNA 测序 (scRNA-seq) 和大量 RNA 测序数据,构建基于溶酶体依赖性细胞死亡 (LDCD) 评分的 NSCLC 患者分层预后模型。通过分析高危和低危人群的免疫相关特征,我们进一步探讨了细胞死亡模式对肺癌的影响,并确定了潜在的治疗靶点。方法 本研究从 GEO 和 TCGA 数据库中获取 NSCLC 患者和正常肺组织的单细胞 RNA 测序数据和基因表达数据。我们使用 Seurat 和 CellChat 等 RR 软件包进行数据预处理和分析,并通过主成分分析 (PCA) 和 UMAP 算法进行降维和可视化。采用 LASSO 回归分析构建预测模型,然后进行交叉验证和 ROC 曲线分析。该模型的有效性通过生存分析和免疫微环境分析得到验证。 结果 研究表明 NSCLC 组织中单核细胞的比例显着增加,表明它们在癌症进展中起重要作用。细胞通讯分析表明,巨噬细胞、平滑肌细胞和骨髓细胞在癌症进展过程中表现出很强的细胞间通讯。使用构建的 prognostic
model based on 12 LDCD-related genes, we found significant differences in overall survival and immune microenvironment between the high-risk and low-risk groups. 模型基于 12 个 LDCD 相关基因,我们发现高危组和低危组的总生存期和免疫微环境存在显著差异。
Keywords Non-small cell lung cancer • Lysosome-dependent cell death • Single-cell 关键词 非小细胞肺癌 • 溶酶体依赖性细胞死亡 • 单细胞
1 Introduction 1 引言
Lung cancer is the most common tumor globally and the leading cause of cancer-related deaths. Non-small cell lung cancer (NSCLC) accounts for 85%85 \% of all lung cancers. According to WHO guidelines, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) are the most common subtypes [1, 2]. Many factors contribute to the progression of lung cancer, including age, gender, living environment, and smoking status. With ongoing research into NSCLC, treatment strategies have evolved, encompassing immune checkpoint-based immunotherapy and chemotherapy [3]. However, the heterogeneity of NSCLC patients significantly impacts treatment outcomes, with studies showing that some patients exhibit minimal response to immune checkpoint inhibitors [4]. This suggests that treatment response is closely linked to various factors, and not all NSCLC patients benefit from current treatment strategies. Therefore, it is necessary to develop predictive models for patient stratification considering gene expression and clinical characteristics. Through patient stratification, we can identify responses to different treatment strategies and implement appropriate treatments for different patient groups, aligning with the principles of precision therapy and rational drug use. 肺癌是全球最常见的肿瘤,也是癌症相关死亡的主要原因。非小细胞肺癌 (NSCLC) 占 85%85 \% 所有肺癌。根据 WHO 指南,肺腺癌 (LUAD) 和肺鳞状细胞癌 (LUSC) 是最常见的亚型 [1, 2]。许多因素会导致肺癌的进展,包括年龄、性别、生活环境和吸烟状况。随着对 NSCLC 的研究不断深入,治疗策略不断发展,包括基于免疫检查点的免疫治疗和化疗 [3]。然而,NSCLC 患者的异质性会显著影响治疗结局,研究表明一些患者对免疫检查点抑制剂的反应很小 [4]。这表明治疗反应与各种因素密切相关,并非所有 NSCLC 患者都受益于当前的治疗策略。因此,有必要开发考虑基因表达和临床特征的患者分层预测模型。通过患者分层,我们可以识别对不同治疗策略的反应,并针对不同的患者群体实施适当的治疗,符合精准治疗和合理用药的原则。
Lysosome-dependent death is a unique mode of cell death that has great significance for cellular life activities. Lysosomes, as cellular recycling centers, are filled with many hydrolytic enzymes that can degrade most cellular macromolecules. Lysosomal membrane permeabilization and the consequent leakage of lysosomal contents into the cytoplasmic lysate leads to so-called “lysosome-dependent death”. This form of cell death is mainly carried out by lysosomal organizing proteases and can have necrotic, apoptotic, or apoptosis-like features depending on the extent of leakage and the cellular context [5]. Many studies have demonstrated that lysosomal-dependent death has an important role in the therapeutic process of tumors [6], and that tumorigenesis can be inhibited by inducing lysosomal-dependent death in tumors. 溶酶体依赖性死亡是一种独特的细胞死亡模式,对细胞生命活动具有重要意义。溶酶体作为细胞回收中心,充满了许多可以降解大多数细胞大分子的水解酶。溶酶体膜透化和随之而来的溶酶体内容物泄漏到细胞质裂解物中导致所谓的“溶酶体依赖性死亡”。这种形式的细胞死亡主要由溶酶体组织蛋白酶进行,根据渗漏的程度和细胞环境,可以具有坏死、凋亡或凋亡样特征 [5]。许多研究表明,溶酶体依赖性死亡在肿瘤的治疗过程中具有重要作用 [6],并且可以通过诱导肿瘤溶酶体依赖性死亡来抑制肿瘤发生。
Single-cell RNA sequencing (scRNA-seq) reveals the highly complex cellular composition of the tumor microenvironment (TME) with high resolution [7]. This technique can uncover developmental changes and cell interaction information within tumor cells with extreme precision, providing new insights into tumor bioinformatics. It is also a powerful tool for future exploration of common characteristics and key differences among various immune cell subsets in the TME [8]. Meanwhile, machine learning, an important branch of artificial intelligence (AI), focuses on enabling computer systems to learn from data and make predictions or decisions. By developing and applying algorithms, machine learning allows computers to recognize patterns and regularities in data, thus improving and enhancing performance without explicit programming. In the biomedical field, researchers use machine learning to analyze clinical data and develop diagnostic and prognostic models for diseases [9]. By combining the critical tumor microenvironment revealed by single-cell RNA sequencing technology with machine learning, We can build stable prognostic models based on the clinical characteristics of non-small cell lung cancer (NSCLC) patients to explore the role of lysosome-dependent cell death in lung carcinogenesis.In summary, the aim of this study was to construct a prognostic model for stratifying NSCLC patients based on lysosomal-dependent cell death scores by integrating clinical features such as single-cell RNA sequencing (scRNA-seq) and bulk RNA sequencing (bulk RNA-seq) data. Specifically, we analyzed scRNA-seq and bulk RNA-seq data separately and performed detailed comparisons and analyses of high-risk and low-risk groups, such as those related to immune responses. This approach enabled us to gain deeper insights into the impact of different cell death modes on lung cancer and to identify potential therapeutic targets. 单细胞 RNA 测序 (scRNA-seq) 以高分辨率揭示了肿瘤微环境 (TME) 的高度复杂细胞组成 [7]。该技术可以极其精确地揭示肿瘤细胞内的发育变化和细胞相互作用信息,为肿瘤生物信息学提供新的见解。它也是未来探索 TME 中各种免疫细胞亚群之间的共同特征和关键差异的有力工具 [8]。同时,机器学习是人工智能 (AI) 的一个重要分支,专注于使计算机系统能够从数据中学习并做出预测或决策。通过开发和应用算法,机器学习使计算机能够识别数据中的模式和规律,从而在没有显式编程的情况下改进和增强性能。在生物医学领域,研究人员使用机器学习来分析临床数据并开发疾病的诊断和预后模型 [9]。通过将单细胞 RNA 测序技术揭示的关键肿瘤微环境与机器学习相结合,我们可以根据非小细胞肺癌 (NSCLC) 患者的临床特征构建稳定的预后模型,以探索溶酶体依赖性细胞死亡在肺癌发生中的作用。综上所述,本研究的目的是通过整合单细胞 RNA 测序 (scRNA-seq) 和批量 RNA 测序 (bulk RNA-seq) 数据等临床特征,构建基于溶酶体依赖性细胞死亡评分对 NSCLC 患者进行分层的预后模型。具体来说,我们分别分析了 scRNA-seq 和大量 RNA-seq 数据,并对高风险和低风险群体进行了详细的比较和分析,例如与免疫反应相关的人群。 这种方法使我们能够更深入地了解不同细胞死亡模式对肺癌的影响,并确定潜在的治疗靶点。
Through this multi-level data integration and analysis, we were able not only to predict the prognosis of NSCLC patients more accurately but also to reveal the specific mechanisms of lysosome-dependent cell death in lung cancer progression. This provides an important basis for the development of personalized treatment plans and helps to discover new and effective therapeutic targets, thereby improving the treatment outcomes and quality of life for NSCLC patients. In conclusion, this study offers a new perspective on the prognostic evaluation of NSCLC and provides important theoretical and practical foundations for exploring the impact of cell death modes on cancer development. 通过这种多层次的数据整合和分析,我们不仅能够更准确地预测 NSCLC 患者的预后,而且能够揭示溶酶体依赖性细胞死亡在肺癌进展中的具体机制。这为制定个性化治疗计划提供了重要依据,有助于发现新的有效治疗靶点,从而改善 NSCLC 患者的治疗结果和生活质量。综上所述,本研究为 NSCLC 的预后评估提供了新的视角,为探讨细胞死亡模式对癌症发展的影响提供了重要的理论和实践基础。
2 Methods and materials 2 方法和材料
2.1 Data collection and preprocessing 2.1 数据收集和预处理
From GEO GSE198099 data set in database (https://www.ncbi.nlm.nih.gov/) for patients with non-small cell lung cancer (GSM5938737, GSM5938738) and normal lung tissue (GSM5938739, GSM5938740) single-celled RNA sequencing data. In addition, from the TCGA database (https://portal.gdc.cancer.gov/) and GEO GSE30219 data set in the database, respectively for 585 cases and 272 cases of patients with non-small cell lung cancer gene expression profile, Their clinical characteristics such as survival status, survival time, and TMN stage are shown in Supplementary Tables 1 and 2. The combined data were batch corrected using the “ComBat” function from"limma" (PMC4402510) and “sva” R package. TCGA was used as the training set and GSE30219 as the test set. 来自 GEO GSE198099数据库 (https://www.ncbi.nlm.nih.gov/) 中非小细胞肺癌 (GSM5938737, GSM5938738) 和正常肺组织 (GSM5938739, GSM5938740) 患者单细胞 RNA 测序数据的数据集。此外,来自 TCGA 数据库 (https://portal.gdc.cancer.gov/) 和 GEO GSE30219数据库中的数据集,分别为 585 例和 272 例非小细胞肺癌基因表达谱患者,其临床特征如生存状态、生存时间和 TMN 分期显示在补充表 1 和 2 中。使用“limma” (PMC4402510) 和 “sva” R 包中的 “ComBat” 函数对组合数据进行批量校正。以 TCGA 为训练集,GSE30219 为测试集。
2.2 Processing of scRNA-seq data 2.2 scRNA-seq 数据的处理
Single-cell RNA sequencing data were read from 10X files, and a “Seurat” object was created. Cells with low quality were filtered out based on criteria of minimum 200 genes, maximum 4000 genes, and mitochondrial gene proportion of 20%. Differential expression genes were selected using the “FindVariableFeatures()” function, and a plot of these genes was generated using the “VariableFeaturePlot()” function. The data were standardized using the “ScaleData()” function to remove batch effects in gene expression levels. The top 10 differentially expressed genes were labeled on the plot for further analysis. Principal component analysis (PCA) was performed to reduce dimensionality, and highly variable genes were selected as features. Dimensionality reduction visualization was carried out using “tSNE” and “UMAP” algorithms. The “createCellChat” function creates a “CellChat” object for cell communication analysis, identifies overexpressed genes and ligand-receptor pairs, and maps ligands and receptors onto the protein-protein interaction network. 从 10X 文件中读取单细胞 RNA 测序数据,并创建一个“Seurat”对象。根据最少 200 个基因、最多 4000 个基因和 20% 的线粒体基因比例的标准过滤掉低质量的细胞。使用 “FindVariableFeatures()” 函数选择差异表达基因,并使用 “VariableFeaturePlot()” 函数生成这些基因的曲线图。使用 “ScaleData()” 函数对数据进行标准化,以消除基因表达水平的批量效应。在图上标记前 10 个差异表达基因以供进一步分析。进行主成分分析 (PCA) 以降低维度,并选择高度可变的基因作为特征。使用 “tSNE” 和 “UMAP” 算法进行降维可视化。“createCellChat” 函数创建一个 “CellChat” 对象用于细胞通讯分析,识别过表达的基因和配体-受体对,并将配体和受体映射到蛋白质-蛋白质相互作用网络上。
2.3 The identification of mononuclear cells and communication analysis 2.3 单核细胞的鉴定和通讯分析
Single cell RNA sequencing (scRNA-seq) data using the Seurat packages were analyzed, and the first to use t—are initially dimension reduction, SNE in visualization of cells depending on the type of organization. Monocytes were isolated and reclustered by scale analysis and principal component analysis (PCA). Key myeloid marker genes were identified using dot plots and cell types were annotated accordingly to obtain monocyte subsets. Non-small cell lung cancer (non-small cell lung cancer, NSCLC) organization of mononuclear cells are integrated into monocytes data set for further analysis. Using CellChat package analysis intercellular communication, mainly analyzes the interaction of secretion signal. We identified the excessive expression of genes and interaction, and calculate the probability of communication, and use the network diagram visualization. We passed the heat map analysis and visualization centricity index and the signal function, highlight the outgoing and incoming signal model. 使用 Seurat 软件包分析了单细胞 RNA 测序 (scRNA-seq) 数据,第一个使用 t—最初是降维,SNE 在细胞可视化中取决于组织类型。通过规模分析和主成分分析 (PCA) 分离单核细胞并进行重新聚集。使用点图鉴定关键髓系标记基因,并相应地注释细胞类型以获得单核细胞亚群。将单核细胞的非小细胞肺癌 (non-small cell lung cancer, NSCLC) 组织整合到单核细胞数据集中以供进一步分析。使用 CellChat 包分析细胞间通讯,主要分析分泌信号的相互作用。我们确定了基因的过度表达和相互作用,并计算了通信的概率,并使用网络图可视化。我们通过热图分析和可视化中心度指数和信号功能,突出出入信号模型。
2.4 Analysis of monocyte subpopulations 2.4 单核细胞亚群分析
Software packages such as “reshape2”, “ggplot2” and “dplyr” were used to organize and visualize the data, obtain statistical analysis of cell types and generate bar charts. Subsequently, we used the Seurat software package for dimensionality reduction, clustering, and identification of monocyte subsets. Correction mass effect, the use of “harmonious” algorithm using UMAP algorithm dimensionality of data visualization. For cell communication analysis, we constructed cell communication networks using the “CellChat” package and identified and analyzed ligand-receptor pairs. In addition, we also performed visual analysis of the topology and signaling pathways of the cellular communication network. 使用 “reshape2” 、 “ggplot2” 和 “dplyr” 等软件包对数据进行组织和可视化,获得细胞类型的统计分析并生成条形图。随后,我们使用 Seurat 软件包进行单核细胞亚群的降维、聚类和鉴定。校正质量效应,采用“和谐”算法,采用 UMAP 算法实现数据维数可视化。对于细胞通讯分析,我们使用 “CellChat” 包构建了细胞通讯网络,并识别和分析了配体-受体对。此外,我们还对蜂窝通信网络的拓扑和信号通路进行了可视化分析。
2.5 Modularization and network analysis of monocyte scRNA-seq data using "hdWGCNA" method 2.5 使用“hdWGCNA”方法对单核细胞 scRNA-seq 数据进行模块化和网络分析
We preprocessed and cleaned the raw data using the “hdWGCNA” and “Seurat” packages in R. Subsequently, we filtered genes expressed in at least 5% of cells and constructed “metacells”, followed by normalization of the “metacell” expression matrix. Next, we determined the appropriate soft power based on testing soft threshold values and constructed a coexpression network. Based on this network, we generated a dendrogram of the co-expression network and obtained the TOM matrix for subsequent advanced analysis. We also calculated the module eigengenes and performed inter-modular 我们使用 R 中的“hdWGCNA”和“Seurat”包对原始数据进行预处理和清理。随后,我们过滤了在至少 5% 的细胞中表达的基因并构建了“元细胞”,然后对“元细胞”表达矩阵进行标准化。接下来,我们根据测试软阈值确定了合适的软实力,并构建了一个共表达网络。基于这个网络,我们生成了共表达网络的树状图,并获得了 TOM 矩阵用于后续的高级分析。我们还计算了模块特征基因并进行了模块化
Jiangping Fu, Yaohua Chen, Jie Li and Ming Tan have contributed equally to this work. 傅江平、陈耀华、李杰和谭明对这项工作做出了同样的贡献。