Abstract 摘要
Cells respond to environmental stimuli through transcriptional responses, orchestrated by transcription factors (TFs) that interpret the gene cis-regulatory DNA sequences, determining gene expression dynamics timing and locations. Diversification in TFs and cis-regulatory element (CRE) interactions result in unique gene regulatory networks (GRNs) that underpin plant adaptation. A primary challenge is identifying Transcription Factor Binding Motifs (TFBMs) for temporal and condition-specific gene expressions in plants. While the Multiple EM for Motif Elicitation (MEME) suite identifies stress-responsive CREs in Arabidopsis, its predictive power for gene expression remains uncertain. Alternatively, the k-mer approach identifies CRE sites and consensus TF motifs, thereby improving gene expression prediction models. In this study, we harnessed the power of a k-mer pipeline to address sequence-to-expression prediction problems across diverse abiotic stresses, in both bryophytic and vascular plants, including monocots and dicots. Moreover, we characterized both un-gapped and gapped CREs and, coupled with GRN analyses, pinpointed key TFs within transcriptional cascades. Lastly, we developed the Predictive Regulatory Element Database for Identifying Cis-regulatory elements and Transcription factors (PREDICT), a web tool for efficient k-mer identification. This advancement will enrich our understanding of the cis-regulatory code landscape that shapes gene regulation in plant adaptation. PREDICT web tool is available at [http://predict.southerngenomics.org/kmers/kmers.php].
细胞通过转录反应对环境刺激做出反应,转录因子(TFs)负责解释基因顺式调控 DNA 序列,决定基因表达动态的时间和位置。转录因子的多样性和顺式调控元件(CRE)的相互作用形成了独特的基因调控网络(GRN),是植物适应性的基础。一个主要挑战是识别转录因子结合基元(TFBM),以确定植物基因表达的时间和条件特异性。虽然多 EM 动机激发(MEME)套件能识别拟南芥中的胁迫响应 CREs,但其对基因表达的预测能力仍不确定。另外,k-mer 方法可以识别 CRE 位点和共识 TF 基序,从而改进基因表达预测模型。在本研究中,我们利用 k-mer 管道的强大功能,解决了包括单子叶植物和双子叶植物在内的木本植物和维管束植物在各种非生物胁迫下的序列到表达预测问题。此外,我们还鉴定了无间隙和有间隙的 CRE,并结合 GRN 分析,确定了转录级联中的关键 TF。最后,我们开发了用于识别顺式调控 元件和转录因子的预测性 调控元件数据库 (PREDICT),这是一种用于高效识别 k-mer 的网络工具。 这一进展将丰富我们对形成植物适应性基因调控的顺式调控代码景观的理解。PREDICT 网络工具见 [http://predict.southerngenomics.org/kmers/kmers.php]。
Introduction 导言
The fluctuating environments lead to the phenotypic diversities of crops, aiding their adaptation to various conditions and contributing to economic improvement. Previous studies have shown that transcriptional regulatory mechanisms play a crucial role in regulating gene expressions, which in turn contributes to phenotypic diversity across species (1). The transcription regulatory mechanism comprises at least two components: transcription factors (TFs) and cis-regulatory sites. A TF typically recognizes multiple, slightly varied cis-regulatory sites collectively referred to as a cis-regulatory element (CRE), representing the binding specificity of TFs (2). Consequently, variations in gene expression may arise from differences in the cis-regulatory sites on genes and/or the TF binding specificities to genes. Such variation serves as the driving force for the expansion of gene regulatory networks (GRNs) throughout the evolution of land plants for climate adaptation (3, 4). Therefore, developing a method to identify Transcription Factor Binding Motifs (TFBMs) for building the GRNs and elucidating transcriptional regulatory cascades becomes a significant question. The Multiple EM for Motif Elicitation (MEME) suite is widely used for de novo motif discovery, requiring no prior information on TFBMs, and has successfully identified several known stress-responsive CREs among stress-regulated genes in Arabidopsis (5, 6). This suggests the application of the MEME suite in identifying TF binding sites and subsequently summarizing them into the consensus TF motifs in plants. However, although the MEME suite is a popular approach for elucidating several TFBMs, the extent to which this pipeline can efficiently uncover the sets of TFBMs that well predict and explain plant gene expression regulation remains unclear (1).
多变的环境导致作物表型的多样性,帮助作物适应各种条件,并促进经济效益的提高。以往的研究表明,转录调控机制在调控基因表达方面起着至关重要的作用,而基因表达反过来又促成了物种间的表型多样性 (1)。转录调控机制至少由两部分组成:转录因子(TF)和顺式调控位点。转录因子通常识别多个略有不同的顺式调节位点,这些位点统称为顺式调节元件(CRE),代表了转录因子的结合特异性 (2)。因此,基因表达的变化可能源于基因上顺式调节位点的不同和/或 TF 与基因结合特异性的不同。在陆地植物适应气候的进化过程中,这种变化是基因调控网络(GRN)扩展的驱动力 (3,4)。因此,开发一种方法来识别转录因子结合基元(TFBMs)以构建基因调控网络并阐明转录调控级联成为一个重要问题。多 EM 动机激发(MEME)套件被广泛用于从头发现动机,不需要事先了解 TFBMs 的信息,它已经成功地在拟南芥的胁迫调控基因中发现了几个已知的胁迫响应 CREs(5,6)。 这表明 MEME 套件可用于识别 TF 结合位点,并随后将其归纳为植物中的共识 TF 基序。然而,尽管 MEME 套件是阐明多个 TFBM 的常用方法,但这一管道能在多大程度上有效地发现能很好地预测和解释植物基因表达调控的 TFBM 集合仍不清楚 (1)。
We previously introduced a k-mer pipeline to efficiently and simply identify putative cis-regulatory elements (pCREs) from stress-response genes in tomatoes, Arabidopsis, and Marchantia (1, 3). These k-mer-identified pCREs were well predictive of gene expressions and some of them are similar to known TFBMs, suggesting they are authentic cis-regulatory elements. Here we expanded k-mer analyses to multiple stresses across bryophytic, monocot, and dicot plants and further assessed its application in identifying both un-gapped and gapped TFBMs. Coupled with GRN analyses, we further explore the pCREs likely bound by hub TFs to uncover transcriptional regulatory cascades. Compared to the MEME suite, the k-mer approach not only achieves this goal of identifying TFBMs, but also allows us to further map motif sites onto genes, predict gene expressions, and identify key TFs in GRNs. Together, our study developed a k-mer pipeline suited for gene expression prediction and built a web tool (PREDICT) for the easy and rapid identification of k-mer and the likely bound TFs for the general public.
我们之前引入了一个 k-mer 管道,用于从番茄、拟南芥和马钱科植物的应激反应基因中高效、简单地识别推定顺式调控元件(pCREs)(1,3)。这些经 k-mer 鉴定的 pCRE 对基因表达有很好的预测作用,其中一些与已知的 TFBM 相似,表明它们是真实的顺式调控元件。在这里,我们将 k-mer 分析扩展到了多种胁迫下的红叶植物、单子叶植物和双子叶植物,并进一步评估了它在鉴定未间隙和间隙 TFBMs 方面的应用。结合 GRN 分析,我们进一步探索了可能与枢纽 TF 结合的 pCRE,从而发现了转录调控级联。与 MEME 套件相比,k-mer 方法不仅能实现识别 TFBM 的目标,还能进一步将图案位点映射到基因上、预测基因表达并识别 GRN 中的关键 TF。总之,我们的研究开发了适合基因表达预测的 k-mer 管道,并建立了一个网络工具(PREDICT),方便公众快速识别 k-mer 和可能结合的 TFs。
Result and Discussion 结果与讨论
k-mer pipeline effectively searches for motifs to predict gene expression in response to stress conditions
k-mer 管道可有效搜索主题,预测基因在应激条件下的表达情况
In our study of the evolution of transcriptional mechanisms and their association with differential stress tolerance across plant species, we re-analyzed publicly available RNA-seq datasets from different plant species subjected to various types of abiotic stress (1, 3, 4, 7–10) (Fig. S1A & S1B, Supplemental file 1). Subsequently, we clustered these stress-regulated genes based on their distinct expression patterns. These datasets included responses to wounding, salt, and high temperature stresses in two phylogenetically distant species, Marchantia polymorpha, and Arabidopsis thaliana, as well as a crop, Solanum lycopersicum. We utilized a k-mer (an oligomer with a length of k ≥ 5 base-pairs) pipeline to identify significantly enriched sequences among genes in different clusters (Fig. 1A. See Methods). We then compared the motifs identified from the k-mer pipeline with those identified using the MEME pipeline, a classic and well-recognized approach for discovering cis-elements (Fig. 1A. See Methods). We observed that the k-mer pipeline tends to identify a larger number of sequence motifs compared to the MEME pipeline (∼30-100 from k-mers vs. ∼10 from MEME, respectively, Fig. S1C, Supplemental file 1).
在研究转录机制的进化及其与植物物种间不同胁迫耐受性的关联时,我们重新分析了来自不同植物物种的公开 RNA-seq 数据集,这些物种受到了各种类型的非生物胁迫 (1、3、4、7-10)(图 S1A 和 S1B,补充文件 1)。随后,我们根据这些应激调控基因的不同表达模式对其进行了聚类。这些数据集包括两个在系统发育上相距甚远的物种-- 马钱子和拟南芥 ,以及一种作物-- 番茄 。我们利用 k-mer(长度 k≥ 5 碱基对的寡聚体)管道在不同簇的基因中识别出显著富集的序列 (图 1A,见方法)。然后,我们将 k-mer 管道识别出的主题与 MEME 管道识别出的主题进行了比较 (图 1A,见方法),MEME 管道是发现顺式元件的经典且广受认可的方法。我们观察到,与 MEME 管道相比,k-mer 管道往往能识别出更多的序列主题(k-mer 与 MEME 分别识别出 30-100 个主题与 10 个主题,图 S1C,补充文件 1)。
在预测应激反应基因表达方面,k-mer 管道优于 MEME 管道。
A. The k-mer and MEME pipeline workflows for motif identification and machine learning (ML) prediction. B. Heatmaps show ML performance (AUROC scores) of stress response prediction using k-mer or MEME pipelines for gene clusters with distinct regulatory patterns upon wounding in tomato, salt stress in Arabidopsis, and Marchantia, and heat stress in Marchantia, Arabidopsis, and tomato. The thresholds of 10-4 and 10-6 were applied to infer a motif site of MEME-derived pCREs on genes (See Material and Methods). The AUROC scores are the mean values of 10-fold cross-validation. The full prediction results are shown in Supplemental file S1.
A.k-mer 和 MEME 管道工作流程,用于主题识别和机器学习(ML)预测。B. 热图显示了使用 k-mer 或 MEME 管道对番茄受伤、拟南芥和拟南芥盐胁迫以及拟南芥、拟南芥和番茄热胁迫时具有不同调控模式的基因簇进行胁迫响应预测的 ML 性能(AUROC 分数)。应用 10-4 和 10-6 的阈值来推断基因上 MEME 衍生 pCRE 的主题位点(见材料和方法)。AUROC 分数是 10 倍交叉验证的平均值。完整的预测结果见补充文件 S1。
To assess the relationship between the stress response and the short DNA sequences identified through the k-mer and MEME approaches, respectively, we evaluated how effectively these enriched short sequences could explain and predict the stress response in each gene cluster. We utilized machine-learning (ML) algorithms (See Methods) to predict stress-responsive genes based on the identified short sequences. In each cluster, the stress-response prediction model utilizing k-mer-derived short sequences outperformed the model that used MEME-derived sequences. This was indicated by AUROC scores ranging from 0.6-0.9 from the k-mer pipeline and 0.5-0.6 for the MEME pipeline with few exceptions with the salt stress in Marchantia G4 (Fig. 1B, Supplemental Fig. S1C, Supplemental file S2). These results showed that the short DNA sequences identified through the k-mer approaches were likely the putative cis-regulatory elements (pCREs). Furthermore, the k-mer pipeline is more efficient in identifying pCREs for stress-responsive genes across various stress types and plant species, suggesting the efficiency and generality of the k-mer pipeline in detecting cis-elements. While the MEME suite has been widely used and successful in identifying CREs for tissue- and/or stress-responsive gene expressions in Arabidopsis (6, 11), we observed its subpar performance in predicting the stress response of genes in this study. One possible reason could be the loss of sequence information of motifs generated by the MEME suite. For each motif, MEME converts all possible motif sites into position weight matrices (PWM), which can result in a loss of specific sequence information. Consequently, when mapping motifs back to the genome, this could lead to high false positive rates in identifying cis-element sites, resulting in decreased prediction performance.
为了评估应激反应与分别通过 k-mer 和 MEME 方法鉴定出的 DNA 短序列之间的关系,我们评估了这些富集的短序列在解释和预测每个基因簇的应激反应方面的有效性。我们利用机器学习(ML)算法(见方法),根据识别出的短序列预测应激反应基因。在每个基因簇中,利用 k-mer-衍生短序列的应激反应预测模型优于利用 MEME-衍生序列的模型。k-mer 管道的 AUROC 得分为 0.6-0.9,而 MEME 管道的 AUROC 得分为 0.5-0.6。 这些结果表明,通过 k-mer 方法鉴定出的短 DNA 序列很可能是假定的顺式调控元件(pCRE)。此外,k-mer 管道能更有效地识别不同胁迫类型和植物物种的胁迫反应基因的 pCRE,这表明 k-mer 管道在检测顺式元件方面具有高效性和通用性。虽然 MEME 套件在鉴定拟南芥组织和/或胁迫响应基因表达的 CREs 方面已被广泛使用并取得了成功 (6,11),但在本研究中,我们发现它在预测基因的胁迫响应方面表现不佳。其中一个可能的原因可能是 MEME 套件生成的母题丢失了序列信息。 对于每个图案,MEME 会将所有可能的图案位点转换为位置权重矩阵(PWM),这会导致特定序列信息的丢失。因此,在将图案映射回基因组时,这可能会导致在识别顺式元素位点时出现较高的假阳性率,从而降低预测性能。
k-mer pipeline identifies the cis-regulatory elements near TSSs
k-mer 管道识别 TSS 附近的顺式调控元件
Next, we investigated whether the pCREs identified in stress-responsive clusters are predominantly located around transcriptional start sites (TSSs), a regulatory region known for TF binding (12). We examined the site distribution of pCREs identified from the k-mer and MEME pipelines in the region extending from 1 kb upstream to 0.5 kb downstream of TSSs. The site distributions from the k-mer pipeline were primarily located near TSSs, while those from the MEME pipeline were dispersed randomly along the promoter sequence regions in all stress clusters (Fig. 2). This pattern was particularly pronounced in the case of wounding and salt stress (Fig. 2A & 2B), while the motif distributions in HS were more scattered, especially in Marchantia (Fig. 2C & 2D, Supplemental Fig. S1D). For example, the k-mer pipeline-generated pCREs similar to the motifs of HSF, NAC, and MYB were located near TSSs, consistent with evidence from ChIP-seq data in previous studies (Fig. 2E-2G, Supplemental Fig. S1E) (8, 13, 14). These findings suggested that the information extracted from the k-mer pipeline could provide an alternative method for searching for functional TF motifs on the core promoters.
接下来,我们研究了在应激反应簇中发现的 pCRE 是否主要位于转录起始位点(TSS)周围,这是一个已知的 TF 结合调控区域 (12)。我们研究了 k-mer 和 MEME 管道鉴定出的 pCREs 在 TSSs 上游 1 kb 至下游 0.5 kb 区域的位点分布。来自 k-mer 管道的位点主要分布在 TSSs 附近,而来自 MEME 管道的位点则沿着所有应激簇的启动子序列区域随机分布 (图 2)。这种模式在伤口胁迫和盐胁迫的情况下尤为明显 (图 2A 和 2B),而在 HS 中的基元分布则更为分散,尤其是在 Marchantia 中 (图 2C 和 2D,补充图 S1D)。例如,与 HSF、NAC 和 MYB 的基序相似的 k-mer pipeline 生成的 pCRE 位于 TSSs 附近,这与之前研究中 ChIP-seq 数据的证据一致 (图 2E-2G,补充图 S1E)(8、13、14)。这些发现表明,从 k-mer 管道中提取的信息可以为在核心启动子上寻找功能性 TF 基序提供另一种方法。
k-mer 管道可识别优先位于转录起始位点 (TSS) 附近的主题。
A-B. Distribution of predicted cis-regulatory elements (pCREs) identified by the k-mer pipeline (A) or MEME pipeline (B) during wound stress in tomato, and salt stress in Arabidopsis and Marchantia, respectively. C-D. Similar distributions as in (A) and (B), but for heat stress in Marchantia, Arabidopsis, and tomato. E-G. Heatmaps show the site distributions of k-mer pipeline-derived pCREs that are similar to known TFBMs in response to wounding (E), salt (F), and heat (G) stress, respectively. Site frequency represents the Log2 enrichment of pCRE sites between TPs and TNs and is calculated within a range from 1 kb upstream to 0.5 kb downstream of TSSs using a sliding window of 100 bp and a step size of 25 bp. The mean site frequency values for all pCREs in each cluster are displayed.
A-B。k-mer 管道(A)或 MEME 管道(B)分别识别的番茄伤口胁迫以及拟南芥和马钱子盐胁迫期间的预测顺式调控元件(pCRE)的分布。C-D.类似于 (A) 和 (B) 中的分布,但针对拟南芥、拟南芥和番茄的热胁迫。E-G.热图显示了与已知 TFBMs 相似的 k-mer pipeline 衍生 pCREs 分别对创伤(E)、盐(F)和热(G)胁迫的位点分布。位点频率表示 TPs 和 TNs 之间 pCRE 位点的 Log2 富集度,使用 100 bp 的滑动窗口和 25 bp 的步长在 TSS 上游 1 kb 到下游 0.5 kb 的范围内计算。显示了每个聚类中所有 pCRE 的平均位点频率值。
On the other hand, studies of cis-elements that regulate target genes in a location- and orientation-independent manner, known as enhancers, are just beginning because they often relate to transcription burst frequency (i.e., the probability of transcription activation) rather than the magnitude of mRNA synthesis (i.e., gene expression) (15, 16). Similarly, identifying regulatory sequences that silence gene expression, known as silencers, has been challenging due to the difficulty of classifying the genomic elements (17, 18). Nevertheless, the flexibility of the k-mer pipeline can facilitate the identification of those context-specific features within the genome.
另一方面,以位置和方向无关的方式调控靶基因的顺式元件 (称为增强子)的研究才刚刚开始,因为它们通常与转录爆发频率(即转录激活的概率)而不是 mRNA 合成量(即基因表达)有关(15、16)。同样,由于基因组元件难以分类,识别沉默基因表达的调控序列 ( 即沉默子)一直是个难题 (17、18)。尽管如此,k-mer 管道的灵活性有助于识别基因组中那些特定的上下文特征。
k-mer pipeline identifies the TFs and their known TFBMs in an expression-specific manner
k-mer 管道以特定表达方式识别 TF 及其已知 TFBM
Given that we identified only approximately 10 motifs using the MEME pipeline and with poor prediction performances (Fig. 1), we asked whether stress-responsive gene expressions could be predicted based on known motif sets from plant TFs. The ARABD database with experimental validated TFBMs contains the information of ∼18000 Arabidopsis TFBMs from 529 TFs (19) was applied here. We examined the enriched TFBMs and their corresponding binding sites via Analysis of Motif Enrichment (AME; See Method) approaches for the prediction of gene expression in each gene group. The AUROC scores from the AME pipeline decreased dramatically as the stringency of defining a corresponding binding site increased (i.e., the q-value cutoff decreased from 10-2∼10-4) (Fig. 3A). In addition, the AUROC scores for models using the “counts” of the motif sites on genes were higher compared to those using the “presence” dataset in the q-value cutoff of 10-2 and 10-3. However, those scores were consistent across datasets at the more stringent q-value cutoff (10-4). This indicated that using a less stringent cutoff in the AME pipeline tends to find similar motifs repeatedly in a set of promoters, leading to overfitting in gene expression prediction. In contrast, the k-mer pipeline identified specific short motifs and their perfect-matched sites on the promoters, reducing the likelihood of counting the same regions. Consequently, the predictive power when using either the “counts” or “presence” of the motif sites on genes remained consistent (Fig. 1B). These observations showed that the AME pipeline based on the DAP-seq database was not robust enough to identify models of predicting plant stress response genes.
鉴于我们使用 MEME 管道只鉴定出了大约 10 个主题,而且预测效果不佳 (图 1),我们询问是否可以根据植物 TFs 的已知主题集预测应激反应基因的表达。拟南芥 TFBMs 数据库包含了来自 529 个 TFs(19)的 18000 个拟南芥 TFBMs 信息。我们通过动因富集分析(AME;见方法)方法研究了富集的 TFBMs 及其相应的结合位点,以预测各基因组的基因表达。随着定义相应结合位点的严格程度增加(即 q 值截断值从 10-2∼10-4 降低),AME 管道的 AUROC 分数急剧下降( 图 3A)。此外,与使用 "存在 "数据集的模型相比,在 q 值截止值为 10-2 和 10-3 时,使用基因上的主题位点 "计数 "的模型的 AUROC 得分更高。不过,在更严格的 q 值临界值(10-4)下,不同数据集的得分是一致的。这表明,在 AME 管道中使用较不严格的截止值往往会在一组启动子中重复发现类似的图案,从而导致基因表达预测的过度拟合。相比之下,k-mer 管道能识别启动子上特定的短基元及其完美匹配的位点,从而降低了计算相同区域的可能性。因此,无论是使用基因上的图案位点的 "计数 "还是 "存在",预测能力都保持一致 (图 1B)。 这些观察结果表明,基于 DAP-seq 数据库的 AME 管道在确定预测植物胁迫响应基因的模型方面不够稳健。
k-mer 管道可识别各种压力条件下的特定反式调节器。
A. The heatmaps display prediction scores (AUROC scores) via the Analysis of Motif Enrichment (AME) pipeline (i.e, using the enriched known transcription factor binding motifs (TFBMs) filtered by different q-value cutoffs; See Material and Methods) for each stress group. B. The heatmaps summarize the frequency of transcription factor (TF) families present among clusters of each stress. Shown for the TF families inferred from known TFBMs using the AME pipeline (with q-value cutoff = 10-2) or k-mer pipeline. C. The bubble plot shows the similarity of pCREs, identified from the AME pipeline or k-mer pipeline, to known TFBMs in each stress group. Motif similarities to known TFBMs were calculated using the Pearson Correlation Coefficient (PCC). Quantile represents the enrichment q-values of the motifs identified from the AME or k-mer pipelines, which were standardized by quantile normalization. Motifs were grouped by TF families, and then ranked and scaled by prediction model-derived importance scores. The ratio is calculated as 1 minus the motif’s rank divided by the number of total pCREs identified in each group, ranging from 0 to 1, with higher values indicating higher importance. The highest ratio within each TF family is plotted here. The circles with black strokes indicate those TFBMs identified using the MEME pipeline. The red dashed boxes indicate the TFBMs that are specifically enriched in the salt and wounding dataset.
A.热图显示了通过 "动因富集分析(AME)"管道(即使用经不同 q 值截断值筛选的已知富集转录因子结合动因(TFBM);见材料和方法)对各应激群的预测得分(AUROC 分数)。B. 热图总结了每种胁迫集群中出现的转录因子(TF)家族的频率。图中显示的是使用 AME 管道(q 值截止值 = 10-2)或 k-mer 管道从已知 TFBMs 中推断出的 TF 家族。C. 气泡图显示了从 AME 管道或 k-mer 管道识别出的 pCRE 与每个应力组中已知 TFBM 的相似性。使用皮尔逊相关系数(Pearson Correlation Coefficient, PCC)计算了与已知 TFBM 的相似性。量值表示从 AME 或 k-mer 管道中识别出的基元的富集 q 值,该值通过量值归一化进行标准化。基元按 TF 家族分组,然后根据预测模型得出的重要性评分进行排序和缩放。比值的计算方法是 1 减去主题的等级,再除以每组中鉴定出的 pCRE 总数,从 0 到 1 不等,数值越大表示重要性越高。这里绘制的是每个 TF 家族中最高的比率。带黑色笔划的圆圈表示使用 MEME 管道鉴定出的 TFBM。红色虚线框表示在盐和伤口数据集中特别富集的 TFBM。
Furthermore, we compared the pCREs identified from the k-mer pipeline with TFBMs experimentally validated from the DAP-seq database (19) to determine if the identified pCREs could represent the motifs from various TF families. The pCREs were matched to known TFBMs based on their PWM similarities and were ranked by the importance score generated by the ML algorithm in the k-mers pipeline (Fig. 1). The TFBMs identified by the AME pipeline were present across almost all types of TF families in each cluster, suggesting those motifs may represent the general stress-responsive elements regardless of gene expression patterns (Fig. 3B&3C). In contrast, those identified by the k-mer pipeline exhibited cluster-specific patterns (Fig. 3B&3C). For example, the BZP and bZIP TF families were highly ranked and more enriched in salt and wound stress, while HSF and WRKY TF families were significantly represented in HS conditions. Additionally, the k-mer pipeline identified novel motifs not resembling any known TFBMs, across all stress types (Fig. 3B&3C), indicating that the functions of many cis-regulatory codes remain unknown and await experimental validation. Together, this suggests that the motifs identified from the k-mer pipeline may represent stress- and species-specific TFBMs, while the AME-derived TFBMs represent the motifs that are involved in the general stress responses.
此外,我们还将 k-mer 管道鉴定出的 pCRE 与经 DAP-seq 数据库 (19)实验验证的 TFBM 进行了比较,以确定鉴定出的 pCRE 是否能代表不同 TF 家族的主题。根据 pCRE 与已知 TFBM 的 PWM 相似性进行匹配,并根据 k-mers 管道中 ML 算法生成的重要性得分进行排序 (图 1)。AME 管道识别出的 TFBMs 几乎存在于每个聚类中所有类型的 TF 家族中,这表明无论基因表达模式如何,这些图案都可能代表一般的应激反应元件 (图 3B&3C)。与此相反,k-mer 管道识别出的图案则表现出特定的集群模式 (图 3B&3C)。例如,BZP 和 bZIP TF 家族在盐胁迫和伤口胁迫中排名靠前,富集程度较高,而 HSF 和 WRKY TF 家族在 HS 条件下则有显著的代表性。此外,k-mer 管道还在所有胁迫类型中发现了与任何已知 TFBM 都不相似的新型基序 (图 3B&3C),这表明许多顺式调控代码的功能仍然未知,有待实验验证。总之,这表明从 k-mer 管道识别出的基团可能代表了应激和物种特异性 TFBM,而 AME 衍生的 TFBM 代表了参与一般应激反应的基团。
GRNs with k-mers information are more informative and accurate
包含 k-mers 信息的 GRN 信息量更大、更准确
A transcriptional response is triggered by TFs that directly read and bind to cis-regulatory DNA sequences of genes. We further tested whether including k-mer information would improve the accuracy of HS GRNs. First, we constructed a global HS GRN encompassing 1959 nodes and 6090 edges and then subset this global GRN based on the confidence levels of the edges, resulting in a sub-GRN with 183 nodes and 406 edges (Fig. 4A; Supplemental Fig. S2A). Subsequently, we focused on HSFA2-centered sub-GRN as it plays an important role in HS response in plants (20). We divided the HSFA2 GRNs into three layers based on direct-linked edge information (Fig. 4A & 4B, See Method). We hypothesized that the innermost area (i.e., the 1st layer) contains a higher number of genes directly targeted and bounded by a TF (targeted gene), and hence the corresponding motifs are on the promoter of targeted genes. Supporting this assumption, we found that 92% of nodes contain HSE motifs in the 1st layer of the sub-GRN while only 78% of nodes contain such motifs in the 3rd layer (Fig. 4C). The HSE motifs were specifically enriched in the nodes of all the layers in the network (Fig. 4D). Likewise, the GRNs with k-mer information included a higher ratio of benchmarked genes, such as several heat shock proteins (HSP) including sHSPs, HSP90 and DNAJ (21) (Fig. 4B, Supplemental file S3). We also observed indirect regulation (i.e., no k-mer information in genes) to HSFB2a, HsfB2b, HSFA7a, HSFB1 and HSFA7b (Fig. 4B).
转录反应是由直接读取基因顺式调控 DNA 序列并与之结合的 TF 触发的。我们进一步测试了包含 k-mer 信息是否会提高 HS GRN 的准确性。首先,我们构建了一个包含 1959 个节点和 6090 条边的全局 HS GRN,然后根据边的置信度对全局 GRN 进行子集,得到了一个包含 183 个节点和 406 条边的子 GRN(图 4A;补充图 S2A)。随后,我们重点研究了以 HSFA2 为中心的子基因组网络,因为它在植物的 HS 响应中发挥着重要作用 (20)。我们根据直链边缘信息将 HSFA2 GRN 分成三层 (图 4A 和 4B,见方法)。我们假设,最内层区域(即第 1 层)包含较多的直接靶向基因和以 TF(靶向基因)为边界的基因,因此相应的图案位于靶向基因的启动子上。为支持这一假设,我们发现子基因组网络第 1 层中有 92% 的节点含有 HSE 主题,而第 3 层中只有 78% 的节点含有此类主题 (图 4C)。网络中所有层的节点都特别富集了 HSE 主题 (图 4D)。同样,具有 k-mer 信息的 GRN 包含了较高比例的基准基因,如几种热休克蛋白(HSP),包括 sHSPs、HSP90 和 DNAJ(21)(图 4B,补充文件 S3)。 我们还观察到对 HSFB2a、HsfB2b、HSFA7a、HSFB1 和 HSFA7b 的间接调控(即基因中没有 k-mer 信息)( 图 4B)。
k-mer 管道改进了热应力(HS)GRN 结构。
A. A workflow for GRN identification and inferring directly targeted genes using k-mer information. B. The AtHSFA2 GRN illustrates directly regulated genes (1st layer) and subsequent layers, with known target genes labeled (in black). C. Summary of the number of nodes and edges with or without k-mer information. Shown for the HS GRN will all edges, the top 20% edges and the top 20% edges containing HSF nodes. D. The enrichment of known TFBMs in genes located in different layers of GRNs compared to the global GRN. The significance is determined by Fisher’s exact test with black cells representing p value <0.05. E. Number of AtHSF2 direct-binding genes in the HS-responsive and non-HS-responsive gene clusters from HSFA2 ChIP-seq dataset (22) as well as in the GRNs indicated in (B). The overrepresentation was determined by hypergeometric distribution. F. Enriched GO Biological process terms in the GRNs with or without k-mer information are shown, with q-value (-log2(adjusted p-value)) indicating the significance of the enrichments. G. Ratio of heat-related GO terms in GRNs with or without k-mer information, where the adjusted p-value is determined from Fisher’s exact test.
A. 利用 k-mer 信息识别 GRN 和推断直接靶向基因的工作流程。B. AtHSFA2 GRN 显示了直接调控基因( 第一层 )和后续层,并标注了已知的目标基因(黑色)。C. 带有或不带 k-mer 信息的节点和边的数量汇总。显示的是 HS GRN 的所有边、前 20% 的边和包含 HSF 节点的前 20% 的边。D. 与全局 GRN 相比,位于 GRN 不同层的基因中已知 TFBM 的富集情况。显著性由费雪精确检验确定,黑色单元格代表 p 值小于 0.05。E. HSFA2 ChIP-seq 数据集 (22)中 HS 响应和非 HS 响应基因簇以及(B)中 GRN 中 AtHSF2 直接结合基因的数量。超比例是通过超几何分布确定的。F. 有或没有 k-mer 信息的 GRN 中富集的 GO 生物过程术语,q 值(-log2(调整后 p 值))表示富集的显著性。G. 有或没有 k-mer 信息的 GRN 中与热相关的 GO 术语比率,调整后的 p 值由 Fisher's exact 检验确定。
In addition to known HSP genes, the analyses of hierarchical regulations in the 1st layer sub-GRN found the direct regulations from HSFA2 to NLP4, MYB44, and RAP2.4 (Fig. 4B). To determine whether the nodes with k-mers in the 1st layer of HSFA2 sub-network were indeed the directly bound by HSFA2, we mapped them back to the public available HSFA2 ChIP-seq dataset (22). We found a higher proportion of genes significantly enriched in the clusters (c7, c12, and c14; these were directly taken from the previous publication) directly bound by HSFA2 during HS condition compared to genes in non-HS clusters (c1, c2, and c5). These genes included sHSPs, DNAJ, NLP4, and MYB44 in c7, as well as HSP90 in c12 (Fig. 4E). This observation supports the hypothesis of a direct transcriptional regulation between HSF-related TFs and other unexplored TFs. Lastly, a higher proportion of TF-targeted genes were overrepresented with HS-related GO terms in the sub-GRN that included k-mer information (Fig. 4F & 4G). A similar result was observed when we constructed the SlHSF2-subGRN under heat stress (Supplemental Fig. S2B-S2D).
除了已知的 HSP 基因外, 第一层子 GRN 中的分层调控分析还发现了 HSFA2 对 NLP4、MYB44 和 RAP2.4 的直接调控 (图 4B)。为了确定第一层 HSFA2 子网络中带有 k-mers 的节点是否确实是由 HSFA2 直接绑定的,我们将它们映射回公开的 HSFA2 ChIP-seq 数据集 (22)。我们发现,与非 HS 簇(c1、c2 和 c5)中的基因相比,在 HS 条件下,被 HSFA2 直接结合的基因在簇(c7、c12 和 c14;这些直接取自之前的出版物)中显著富集的比例更高。这些基因包括 c7 中的 sHSPs、DNAJ、NLP4 和 MYB44,以及 c12 中的 HSP90(图 4E)。这一观察结果支持了 HSF 相关 TF 与其他未探索的 TF 之间存在直接转录调控的假设。最后,在包含 k-mer 信息的子基因组中,TF 靶向基因与 HS 相关 GO 术语的比例较高 (图 4F 和 4G)。当我们构建热胁迫下的 SlHSF2-subGRN 时,也观察到了类似的结果(补充图 S2B-S2D)。
Furthermore, when we focused on the sub-network centered by a hub gene, NAC062, which was not previously known to regulate the HS response, we observed that the enriched k-mers on the directly targeted genes of NAC062 were associated with NAC-binding motifs at all layers (Supplemental Fig. S2E& S2F). The GO terms derived from the k-mer network that resembled the NAC binding motif were significantly associated with the general abiotic stress response and response to chitin, compared to the global network (Supplemental Fig. S2G & S2H). These GO terms associated with k-mers that resembled the NAC binding motif differed from those associated with k-mers similar to HSF binding motifs (Supplemental Fig. S2I). The NAC and HSF binding motifs were specifically enriched in their 1st-layer GRNs (Fig. 4D and Supplemental Fig. S2F). These findings suggested that incorporating specific k-mers information into sub-network construction could improve the accuracy of identifying specific TF-targeted gene pairs and inferring their biological functions.
此外,当我们聚焦于以枢纽基因 NAC062 为中心的子网络时,我们观察到 NAC062 的直接靶基因上富集的 k-聚合体与各层的 NAC 结合基序相关(补充图 S2E 和 S2F)。与全局网络相比,从 k-聚合体网络中得出的与 NAC 结合基团相似的 GO 术语与一般非生物胁迫反应和对几丁质的反应有显著关联(补充图 S2G 和 S2H)。这些与 NAC 结合基团相似的 k-mer 相关的 GO 术语与那些与 HSF 结合基团相似的 k-mer 相关的 GO 术语不同(补充图 S2I)。NAC 和 HSF 结合基团特异性地富集在它们的第一层 GRN 中 (图 4D 和补充图 S2F)。这些发现表明,将特定的 k-mers 信息纳入子网络构建可以提高识别特定 TF 靶向基因对并推断其生物学功能的准确性。
Together, this refined GRN construction workflow provided a strategic approach for the in silico prioritization and screening of TFs that are involved in HS response for further experimental validation. Furthermore, it is known that the integration of various data types, such as ATAC-seq or ChIP-seq, can enhance the stability of the GRN architectures (23). Applying the k-mer pipeline to identify motifs in these datasets could further refine the identification of direct targets, thereby untangling the complexity of transcriptional regulatory cascades under different stress conditions.
总之,这种完善的 GRN 构建工作流程为参与 HS 响应的 TFs 的硅学优先排序和筛选提供了一种战略方法,以便进一步进行实验验证。此外,众所周知,ATAC-seq 或 ChIP-seq 等各种数据类型的整合可以提高 GRN 架构的稳定性 (23)。应用 k-mer 管道识别这些数据集中的基序可以进一步完善直接靶标的识别,从而解开不同胁迫条件下转录调控级联的复杂性。
Gapped k-mer pipeline identifies novel HS-responsive motifs
间隙 k-mer 管道识别新型 HS 响应基序
We further noticed that the prediction scores for HS-responsive gene clusters were lower than those for clusters associated with other types of stress, regardless of the plant species (Figs. 1B, 2C, 3B). Given that well-known HS-related motifs (Heat Shock Element, HSE) are gapped, with two wobble nucleotides between GAA and TTC (GAAnnTTC and/ or TTCnnGAA) (24), we modified our k-mer pipeline to enable the search to target gapped k-mers that are similar to HSEs (Fig. 5A, see Method). The F1 and AUROC scores for HS up-regulated clusters or the clusters with more complicated expression patterns (Fig. S1C) improved when we considered gapped k-mer, with up to a 10% increase in each tested species, especially in Marchantia and tomato (Fig. 5B), aligning with our hypothesis and previous observations (8, 14). Less than 10% of gapped k-mers matched the known TFBMs in the DAP-seq database (bottom in Fig. 5C). The gapped k-mers exactly matched HSEs ranked in the top 5 based on their importance scores in each cluster regardless of species (orange dots in Fig. 5C; Supplemental file S4). The gapped k-mers that perfectly matched or had high similarity to the HSE ranked highly in the predictions and were identified more frequently over the 10 machine learning rounds (Fig. 5D & 5E). These observations indicate their substantial contribution to the model’s accuracy. Together with the findings of the un-gapped pCREs (Figs. 1-3), these results indicate the generality and flexibility of the k-mer pipeline to characterize the un-gapped and gapped pCREs for diverse types of stress-related TFs.
我们还注意到,无论植物种类如何,HS 响应基因簇的预测得分均低于与其他类型胁迫相关的基因簇 (图 1B、2C、3B)。鉴于众所周知的 HS 相关基元(Heat Shock Element,HSE)是有间隙的,在 GAA 和 TTC 之间有两个摆动核苷酸(GAAnnTTC 和/或 TTCnnGAA)(24),我们修改了我们的 k-mer 管道,使搜索能够针对与 HSE 相似的有间隙的 k-mer(图 5A,见方法)。当我们考虑间隙 k-mer 时,HS 上调簇或具有更复杂表达模式的簇的 F1 和 AUROC 分数(图 S1C)都有所提高,在每个测试物种中都提高了 10%,尤其是在马钱子和番茄中(图 5B),这与我们的假设和之前的观察结果一致 (8,14)。与 DAP-seq 数据库中已知 TFBM 匹配的间隙 k-mers 不足 10%( 图 5C 底部)。根据每个聚类(无论物种)中 HSE 的重要性得分,完全匹配的间隙 k-mer 排在前 5 位( 图 5C 中的橙色点;补充文件 S4)。与 HSE 完全匹配或具有高度相似性的间隙 k-mers 在预测中排名靠前,并且在 10 轮机器学习中被更频繁地识别出来 (图 5D 和 5E)。 这些观察结果表明,它们对模型的准确性做出了重大贡献。这些结果与非间隙 pCRE 的研究结果 (图 1-3)一起,表明 k-mer 管道在表征不同类型应激相关 TF 的非间隙和间隙 pCRE 方面具有通用性和灵活性。
间隙 k-mer 管道改进了植物的 HS 反应预测。
A. A workflow designed for identifying gapped k-mers under HS conditions. NN represents the gapped sites when searching sites of a tested kmer on genes (See Material and Methods). B. The boxplots illustrate ML performance for the HS dataset in Marchantia, Arabidopsis, and tomato with gapped or non-gapped k-mers identification pipelines. C. A heatmap shows the counts of gapped motifs that are similar to known TFBMs. The similarity between TFBM motifs and gapped motifs and their relative importance scores, represented by quantiles, was indicated at the bottom. D. Dotplots show the frequency in 10 downsizing datasets and the importance score for each enriched gapped k-mer, with red dots indicating known HSE motifs. The dashed red lines represent the median value of important scores and frequency in each group. E. Representative sequence logo from un-gapped and gapped k-mers that matched to HSE from DAP-seq database.
A.为在 HS 条件下识别间隙 k-mers 而设计的工作流程。NN 表示在基因上搜索测试 kmer 位点时出现的间隙位点(见材料与方法)。B. 方框图说明了在马钱子、拟南芥和番茄的 HS 数据集上使用间隙或非间隙 k-mers 鉴定管道的 ML 性能。C. 热图显示了与已知 TFBM 相似的间隙基元的数量。底部显示了 TFBM 主题和间隙主题之间的相似性及其相对重要性得分(以量级表示)。D. 点阵图显示了 10 个缩小数据集中的频率和每个富集的间隙 k-mer 的重要性得分,红点表示已知的 HSE 主题。红色虚线代表每组重要得分和频率的中值。E. DAP-seq 数据库中与 HSE 匹配的未间隙和间隙 k-mer 的代表性序列标识。
PREDICT: an easy-to-use web tool for k-mer identification and TFBM association
PREDICT:用于 k-mer 识别和 TFBM 关联的易用网络工具
To streamline the identification and visualization of k-mers intuitively using our pipeline, we built a new web-based tool, Predictive Regulatory Element Database for Identifying Cis-regulatory elements and Transcription factors web tool (PREDICT), for conducting the motif search using k-mer pipeline, visualizing k-mer sites along genes and associating k-mers to the known TFBMs. The workflow of the PREDICT web tool can be broken down into three major steps, as shown in Fig. 6A:
为了简化 k- mer 的识别和可视化,我们建立了一个新的基于网络的工具--用于识别顺式调控 元件和转录因子的预测性 调控元件数据库网络工具(PREDICT),用于使用 k-mer 管道进行 motif 搜索,可视化沿基因的 k- mer 位点,并将 k-mer 与已知的 TFBM 关联起来。 如图 6A 所示,PREDICT 网络工具的工作流程可分为三个主要步骤:
用于识别顺式调控元件和转录因子的预测性调控元件数据库(PREDICT)可识别水稻数据集中热胁迫条件下的富集 k-mers。
A. Schematic representation of the major steps in the PREDICT web tool. B. Representative screenshots for PREDICT. Note that in the web tool version, the program will search starting from 4- or 5-mers until 12-mers. The PREDICT tab is for data upload and executing the program and the Kmer-view tab is for exploring the sites of one k-mer in multiple genes or multiple k-mers in one gene. C. The heat-responsive expression patterns, the gene numbers, the counts of k-mers, and the ML prediction performance (AUROC scores) for gene clusters identified from rice. Note that only selected groups were subjected to ML prediction due to the amount of enriched k-mers.
A.PREDICT 网络工具主要步骤示意图。B. PREDICT 的代表性截图。请注意,在网络工具版本中,程序将从 4 或 5 个分子开始搜索,直到 12 个分子。PREDICT 标签用于上传数据和执行程序,Kmer-view 标签用于探索多个基因中一个 k-mer 或一个基因中多个 k-mer 的位点。C. 从水稻中识别出的基因群的热响应表达模式、基因编号、k-mer 计数和 ML 预测性能(AUROC 分数)。请注意,由于富集的 k-mers 数量较多,因此只对选定的组进行了 ML 预测。
Data preparation 数据准备
Users prepare the predefined lists of True Positive (i.e., genes with interested expression patterns) and True Negative (i.e., genes without regulatory patterns) based on the expression patterns observed in the dataset. These lists are saved into a file with a .txt extension before uploading. The users have the option to upload the promoter sequence files in a file with the tab-delimited format and with a .txt extension for the genomic regions of genes from the species of their interest, or they may choose to work with the built-in promoter files of all genes (Arabidopsis thaliana and Oryza sativa cv. japonica). We aim to expand the number of built-in genomes in the future.
用户根据在数据集中观察到的表达模式,准备预定义的 "真阳性"(即具有相关表达模式的基因)和 "真阴性"(即没有调控模式的基因)列表。这些列表在上传前会保存到一个扩展名为 .txt 的文件中。用户可以选择以制表符分隔格式上传启动子序列文件,文件扩展名为.txt,用于他们感兴趣的物种的基因组区域,也可以选择使用所有基因 (拟南芥和粳稻 )的内置启动子文件。我们的目标是在未来扩大内置基因组的数量。
PREDICT identification PREDICT 识别
Once users upload the prepared file to the web tool and click the “Run” button, the program initiates the identification of enriched k-mers. The processing time for PREDICT may vary based on the number of genes in the list, but it typically does not exceed around 2 minutes for a list of around ∼1000 genes. The results are displayed almost instantaneously after the calculations are complete.
用户将准备好的文件上传到网络工具并点击 "运行 "按钮后,程序就会开始识别富集的 k-mers。PREDICT 的处理时间会根据列表中基因的数量而有所不同,但对于一个约 1000 个基因的列表来说,处理时间通常不会超过 2 分钟。计算完成后,结果几乎会立即显示出来。
Data exploration 数据探索
A notification will inform users when the identification job has finished. Users can then download the lists of k-mers and their corresponding p-values by clicking the “Download” button with the results provided in txt file format. Users have the option to display the sites of individual k-mer along all genes or to view the sites of all enriched k-mers in a single gene. In addition, users can visit the MEME website to compare the k-mers with known TFBM and then potentially associate them with specific TFs.
识别工作完成后,系统会发出通知。然后,用户可以点击 "下载 "按钮下载 k-mer 列表及其相应的 p 值,结果以 txt 文件格式提供。用户可以选择显示所有基因中单个 k-mer 的位点,或查看单个基因中所有富集 k-mer 的位点。此外,用户还可以访问 MEME 网站,将 k-mers 与已知的 TFBM 进行比较,然后将它们与特定的 TF 关联起来。
PREDICT has built-in example files that users can load to test the programs and parameters. These examples are derived from our previous publications and analyses (Fig. 1). In addition, to demonstrate the efficacy of our web tool, we utilized an independent rice dataset at HS and drought conditions (SRP190858) for k-mer identification and the prediction of gene expression (Fig. 6B & C). Several k-mers were identified in each HS group, with counts ranging from 34-227 and with the AUROC scores varying from 0.69 to 0.82. Similarly, the k-mer pipeline successfully predicted the gene expressions for drought response in rice with the AUROC scores ranging from 0.77 to 0.88 (Supplemental Fig. S3). PREDICT provided an easy-to-use web service for identifying and comparing enriched k-mers in various stresses, which can now facilitate studies into the diversification or conservation of the cis-regulatory codes in stress responses in the same or different species, thereby guiding the future experiment directions. Beyond the transcriptional regulation, PREDICT also allowed us to identify the k-mer-derived mRNA features around the translational initiation sites (TISs), hence deepening our understanding of the mechanism of translational regulation (25). These results suggested not only that the k-mer pipeline can be widely applied to various research questions at different regulatory levels in stress biology, but also demonstrated the extensive utility of a PREDICT-style analysis.
PREDICT 有内置的示例文件,用户可以加载这些文件来测试程序和参数。这些示例来自我们以前的出版物和分析 (图 1)。此外,为了证明我们的网络工具的有效性,我们使用了一个独立的水稻数据集(SRP190858),在 HS 和干旱条件下进行 k-单体鉴定和基因表达预测 (图 6B 和 C)。在每个 HS 组中都鉴定出了几个 k-聚合体 ,其数量从 34-227 个不等,AUROC 分数从 0.69 到 0.82 不等。同样,k-聚合体管道也成功预测了水稻干旱响应的基因表达,其 AUROC 得分为 0.77 至 0.88(补充图 S3)。PREDICT 提供了一种易于使用的网络服务,用于识别和比较各种胁迫中富集的 k-mer,这有助于研究同一物种或不同物种胁迫反应中顺式调控密码的多样性或保护性,从而指导未来的实验方向。除了转录调控,PREDICT 还让我们识别了翻译起始位点(TISs)周围 k-mer 衍生的 mRNA 特征,从而加深了我们对翻译调控机制的理解 (25)。这些结果不仅表明 k-mer 管道可广泛应用于应激生物学中不同调控水平的各种研究问题,还证明了 PREDICT 式分析的广泛实用性。
Conclusion 结论
Over the past few years, the significance of cis-regulatory elements in studying the regulatory landscapes of various plant species in ever-changing environments, as well as in the field of plant synthetic biology, has been increasingly recognized and emphasized. The web tools based on known TFBMs are used to find motif sites on stress-response genes. These tools and databases have been developed to identify cis-regulatory elements by identifying consensus sequences based on sequence similarity of the known TFMBs (26). However, a user-friendly pipeline integrating the discovery of key cis-regulatory elements on pattern-specific gene clusters to the identification of precise TFBM sites along individual genes has not been established for the plant research community. In this study, we developed a PREDICT web tool with the rapid k-mer identification pipeline for profiling precise cis-regulatory elements on genes among different stresses and across plant species. We first demonstrate that using precise k-mers as progenitors can significantly enhance the prediction of gene expression under various stress conditions across different plant species. Secondly, we provide evidence that k-mers are valuable features for deducing stress-specific TFs. Thirdly, we illustrate that employing k-mers as features to refine the architecture of GRNs can lead to the generation of new hypotheses for subsequent narrowing down the candidates for experimental validation. Lastly, we have developed a web tool called PREDICT, which allows for k-mer searches through a graphical user interface (GUI). We believe that this all-in-one web tool will save researchers time and effort, eliminating the need to navigate various command lines for such analyses, and will enhance the understanding of the mechanisms underlying cis-regulatory elements, with potential applications in synthetic biology.
在过去几年中, 顺式调控元件在研究各种植物物种在不断变化的环境中的调控景观以及在植物合成生物学领域的意义日益得到认可和重视。基于已知 TFBM 的网络工具被用来寻找应激反应基因上的主题位点。开发这些工具和数据库的目的是根据已知 TFMB 的序列相似性确定共识序列,从而识别顺式调控元件 (26)。然而,植物研究界尚未建立一个用户友好型管道,将发现模式特异性基因簇上的关键顺式调控元件与识别单个基因上的精确 TFBM 位点结合起来。在本研究中,我们开发了一个 PREDICT 网络工具,该工具具有快速 k-mer 识别管道,可用于分析不同胁迫和不同植物物种中基因上的精确顺式调控元件。我们首先证明,使用精确的 k-聚合体作为原基可以显著提高对不同植物物种在各种胁迫条件下基因表达的预测能力。其次,我们提供的证据表明,k-mers 是推导胁迫特异性 TFs 的重要特征。第三,我们说明了使用 k 元组作为特征来完善 GRNs 的结构可以产生新的假设,从而缩小实验验证的候选范围。 最后,我们还开发了一款名为 PREDICT 的网络工具,可通过图形用户界面 (GUI) 进行 k-mer 搜索。我们相信,这种一体化的网络工具将为研究人员节省时间和精力,使他们不再需要浏览各种命令行来进行此类分析,并将加深对顺式调控元件内在机制的理解,从而在合成生物学中得到潜在应用。
Material and Methods
k-mer identification pipeline
To identify putative cis-regulatory elements (pCREs) associated with a given stress response, a k-mer (oligomer with the length of k) pipeline was adopted from a previous study (1). Briefly, this pipeline identifies short sequences significantly enriched in the regulatory region of stress-response gene clusters compared to the nonresponsive gene cluster and determines adjusted p-values using Fisher’s exact test and multiple testing correction (Benjamini-Hochberg method) (27). The regulatory region is defined as the region ranging from upstream 1 kb to downstream 0.5 kb of the transcription start site (TSS) of a gene. This k-mer pipeline includes several steps to discover pCREs, as shown in Figure 1A. Below, we describe each step in detail.
Step 1: Evaluate all possible 5-mer oligomers for enrichment among genes in stress-response clusters (positive groups) versus nonresponsive clusters (negative groups). Retain only 5-mers with significant enrichment (adjusted p-values<0.05).
Step 2: Extend the sequence of the significantly enriched 5-mers from Step 1 by 1 nt in either direction, assess the enrichment, and retain those that remain significantly enriched (adjusted p-values<0.05). Repeat this process until no further enriched extended k-mers are identified. If two k-mers are both significantly enriched and one is a match within the other, only the one with a lower adjusted p-value was retained.
Step 3: Repeat the process from Step 1, but begin with all possible 6-mers. Combine significantly enriched 6-mers with the set of the k-mers identified from Step 2.
Step 4: Similar to step 2, but begin with a set of k-mers from Step 3. Finally, identify the set of k-mers significantly enriched in the stress response cluster and consider them as pCREs.
Gapped k-mers identification pipeline
The gapped 8-mers sequences are designed by combining two 3-kmer oligomers separated by a 2-bp gap. A set of the gapped 8-mers is generated by randomly selecting a pair of 3-mers oligomers from all possible 3mers and inserting 2 ‘N’s in-between. The enrichment analyses were performed as described in the previous section.
MEME pipeline for pCRE identification
Another toolkit within the motif-finding pipeline, which includes the MEME and Find Individual Motif Occurrences (FIMO) programs, was adopted to discover pCREs and identify their potential sites on genes (28). To comprehensively identify the MEME-derived pCREs, two search modes were used: Discriminative (with parameters of -mod zoops -nmotifs 50 -nostatus -minw 6 -maxw 15 -objfun classic -revcomp -markov_order 0) and Differential Enrichment (with parameters of -mod zoops -nmotifs 50 -nostatus - minw 6 -maxw 15 -objfun de -revcomp). MEME-derived pCREs with an e-value < 5e-02 identified from either search mode were included for motif site identification via FIMO.
The FIMO-based search for motif sites was conducted with parameters --bgfile --nrdb --thresh 1e-2 --parse-genomic-coord. Motif sites meeting different q-value thresholds as indicated in Figures 1 and 3 were selected for further machine learning (ML) analyses of gene expression.
Analysis of Motif Enrichment (AME) pipeline for enriched known TFMBs searching
First, we used the FIMO program to search for the binding sites of 591 Arabidopsis TFs identified previously (19) on the genes in each group. We then determined the site enrichments for each TFBM based on using the criteria of “counts” or “presence” and “FIMO-derived q-value” ranging from 10-2 to 10-4 by one-tail Fisher’s exact test (FET). Finally, we selected the TFBMs that showed enrichment with a q-value ranging from 10-2 to 10-4 in each group, respectively, for further prediction.
RNA-seq dataset process, true negative and positive dataset selection
Raw read counts were taken from previous studies (See Supplemental file S1). Differentially expressed genes (DEGs) were identified with the R package DESeq2 followed by comparisons to the expression level at 0 h with the threshold of |Log2-fold change| ≥ 1 and False Discovery Rate (FDR) ≤ 0.05 using the “lfcShrink ashr” method. Genes with |Log2-fold change| ≤ 0.8 and FDR > 0.05 across each group were defined as true negative (TN) datasets. The DEGs were then clusters based on their expression pattern in the case of the heat stress dataset. The other groups were directly taken from published studies (1, 3).
Machine learning (ML) pipeline
An ML pipeline described previously was retrieved from GitHub (https://github.com/ShiuLab/ML-Pipeline; https://github.com/azodichr/ML-Pipeline/tree/master/Workshop) (29). Briefly, we used scikit-learn (v0.24.2) in Python (v3.7.0) to train and test the models. For each stress-responsive group, we split balanced data into training (70%) and testing (30%) sets and tested four classification methods: Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), and Gradient Boost (GB). We used ten-fold internal cross-validation to select the optimized hyperparameters. AUC-ROC score was used to select the best model for each stress-responsive group. Random downsizing addressed the imbalance between positive and negative datasets. Enriched k-mers maps from 10 downsizing iterations served as input for training datasets.
Comparison of MEME and k-mer motif distribution
The comparison of motif distribution involved two steps: site retrieval using FIMO and site counting using a sliding window. Initially, sites with enriched motifs, predicted by either k-mer or MEME pipelines, were identified in the promoters of both positive and negative genes. In the MEME pipeline, enriched motifs were systematically scanned with FIMO using match p-value thresholds (10-4 or 10-6), and scanning templates for the specified strand. For the k-mer pipeline, all enriched motifs were subjected to FIMO scanning. During the subsequent site retrieval step, a sliding window approach was used to scan for motif match sites within a given DNA strain. The window size was set to 100 bp, with a 25 bp sliding interval. The average count of matched sites for each enriched motif provided values within each window. To normalize these data, z-scores were computed for the ratio of window values from positive genes to those from negative genes within the same range.
Construction of gene regulatory network (GRN)
Normalized gene expression counts were used to construct GRNs. Time-series heat stress datasets for Arabidopsis thaliana and Solanum lycopersicum were obtained from publicly available datasets on NCBI (7–10). DEGs were identified as aforementioned. Batch effect correction was applied using ComBat-seq (30). DEGs were then categorized into different groups based on their expression at each time point in Arabidopsis and tomato datasets. k-mers associated with each group were identified using the k-mer pipeline, followed by calculating their Pearson correlation coefficients (PCC) with motifs in the Arabidopsis DAP-seq database (31). The motif family with the highest PCC was selected to represent each k-mer.
ARACNe-AP (32) and GENIE3 (33) were used to generate GRNs. TFs were assigned before network construction. Only edges with Bonferroni-corrected p-values below 0.05, based on mutual information (MI) values in the ARACNe-AP network, were retained. For GENIE3, a baseline edge weight of 0.01 was set, and the interquartile range (IQR) of all edge weights was determined. The third quartile (Q3) edge weight was used to filter for high-confidence subnetworks that represent regulatory interactions within GENIE3. An intersection GRN was generated by combining the outputs of both algorithms. The top 20% of edges ranked by weight and MI value were selected to form a representative subnetwork for each species. For validation, key TFs were identified as pseudo-centers to examine the top 20% subnetwork. Nodes were grouped before each GRN construction, incorporating enriched k-mers and motif families. Edges were labeled as “w/ k-mer” if the hub TFs-associated k-mers were present in the promoters of target nodes towards hub TFs, referred to as a “direct-linkage” relationship. Otherwise, they were labeled as “w/o k-mer”. The first-layer network was defined by the immediate neighboring nodes and edges connected to the selected TF whereas the second- and third-layer networks were determined by the immediate neighboring nodes and edges originating from the nodes in the first and second layers, respectively. Gene ontology analysis was performed and compared across GRNs w/ k-mer or w/o k-mer, as well as the original GRN. The GRNs were visualized by Cytoscape software v3.10.0 (34).
Construction of PREDICT web tool
The original scripts were modified to meet the requirements for web-based automated processing. This process involved the translation of scripts from Python 2 to Python 3, as well as the adjustment of parameters for the execution loops in the pipeline. For the web application architecture, we employed the robust LAMP stack, which integrates Linux, Apache, MySQL, and PHP to ensure scalable and efficient performance. To enhance user accessibility, the final outputs are available for a text-based download option. Additionally, the results are integrated for interactive viewing on dedicated web pages.
Data availability
PREDICT is provided as a web tool, which is freely available at [http://predict.southerngenomics.org/kmers/kmers.php]. The accession numbers of all data used in this work are described in Supplemental Fig. 1A. The data and code used to reproduce the analyses are available at https://github.com/LavakauT/ML-pipeline and https://github.com/LavakauT/KmersDiscovery_MEME.
Author contributions
TYW and MJL conceived and designed the project, and TYW, MJL, LT, and WHE analyzed the data and contributed to the design of the web tool. LT, TYW, and MJL wrote the paper. TYW and MJL edited the paper.
Supplemental files
Supplemental file S1 Motifs and their enrichment derived from k-mer, AME, and MEME pipelines for each group
Supplemental file S2 ML performance across different stress- and species-datasets
Supplemental file S3 List of genes used for GRN construction
Supplemental file S4 List of gapped motifs identified in each dataset
Supplemental file S5 Motif sitemaps for pCREs from k-mer and MEME pipelines
Supplemental file S6 List of genes and enriched k-mers in rice heat stress dataset
Supplemental file S7 List of genes enriched k-mers in rice drought stress dataset
Acknowledgments
We thank IPMB and the AS-BCST Bioinformatics Core for high-performance computing services, Ms. Chun-Yi Chen for web tool development, Dr. Yao-Cheng Lin and Mr. Te-Chang Hsu (AS-BSCT group) for technical support, and Dr. Chia-Yi Cheng (National Taiwan University) for critical discussion for this project. The research was supported by grant NSCT 112-2311-B-001-009-MY3 to Ting-Ying Wu and NSTC 112-2311-B-001-005 and AS-CDA-111-L06 to Ming-Jung Liu.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.
- 10.↵
- 11.↵
11.↵ - 12.↵
12.↵ - 13.↵
13.↵ - 14.↵
14.↵ - 15.↵
15.↵ - 16.↵
16.↵ - 17.↵
17.↵ - 18.↵
18.↵ - 19.↵
19.↵ - 20.↵
20.↵ - 21.↵
21.↵H.-c.Liu,Y.-y.Charng,拟南芥 A1 和 A2 类热休克因子在多种非生物胁迫响应和发育中的共同和独特功能植物生理学 163,276-290(2013).AbstractFREE Full TextGoogle Scholar - 22.↵
22.↵ - 23.↵
23.↵ - 24.↵
24.↵N.Santoro,N.Johansson,D. J.Thiele,热休克转录因子对温度和转录激活域的要求中,热休克元件结构是一个重要的决定因素。Mol Cell Biol18, 6340-6352 (1998).AbstractFREE Full TextGoogle Scholar - 25.↵
25.↵T.25.Y.Wuet al., Modeling alternative translation initiation sites in plants reveals evolutionarily conserved cis-regulatory codes in eukaryotes.Genome Resdoi:10.1101/gr.278100.123 (2024).AbstractFREE Full TextGoogle Scholar - 26.↵
26.↵ - 27.↵
27.↵ - 28.↵
28.↵ - 29.↵
29.↵ - 30.↵
30.↵ - 31.↵
31.↵ - 32.↵
32.↵ - 33.↵
33.↵ - 34.↵
34.↵P.Shannonet 等人,Cytoscape:生物分子相互作用网络集成模型的软件环境。Genome Res13, 2498-2504 (2003).AbstractFREE Full TextGoogle Scholar