这是用户在 2025-3-8 10:27 为 https://app.immersivetranslate.com/pdf-pro/f89aa524-c2bd-45cc-a16c-ebae42c05ec6/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

A probabilistic approach to explore human miRNA targetome by integrating miRNA-overexpression data and sequence information
通过整合 miRNA 表达数据和序列信息探索人类 miRNA 靶标组的概率方法

Yue Li 1 , 2 , 1 , 2 , ^(1,2,***){ }^{1,2, \star}, Anna Goldenberg 1 , 3 1 , 3 ^(1,3){ }^{1,3}, Ka-Chun Wong 1 , 2 1 , 2 ^(1,2){ }^{1,2} and Zhaolei Zhang 1 , 2 , 4 , 1 , 2 , 4 , ^(1,2,4,**){ }^{1,2,4, *} 1 1 ^(1){ }^{1} Department of Computer Science, 2 2 ^(2){ }^{2} The Donnelly Centre, University of Toronto, Toronto, Ontario M5S 3E1, 3 3 ^(3){ }^{3} Genetics and Genome Biology, SickKids Research Institute, Toronto, Ontario M5G 1 L 7 1 L 7 1L71 L 7 and 4 4 ^(4){ }^{4} Banting and Best Department of Medical Research and Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 3E1, Canada
1 1 ^(1){ }^{1} 计算机科学系, 2 2 ^(2){ }^{2} 多伦多大学唐纳利中心,多伦多,安大略省M5S 3E1, 3 3 ^(3){ }^{3} 遗传学和基因组生物学,SickKids研究所,多伦多,安大略省M5G 1 L 7 1 L 7 1L71 L 7 4 4 ^(4){ }^{4} 班廷和贝斯特医学研究系和分子遗传学系,多伦多大学,多伦多,安大略省M5S 3E1,加拿大

Associate Editor: Ivo Hofacker
副主编:Ivo Hofacker

Abstract  摘要

Motivation: Systematic identification of microRNA (miRNA) targets remains a challenge. The miRNA overexpression coupled with genome-wide expression profiling is a promising new approach and calls for a new method that integrates expression and sequence information. Results: We developed a probabilistic scoring method called targetScore. TargetScore infers miRNA targets as the transformed fold-changes weighted by the Bayesian posteriors given observed target features. To this end, we compiled 84 datasets from Gene Expression Omnibus corresponding to 77 human tissue or cells and 113 distinct transfected miRNAs. Comparing with other methods, targetScore achieves significantly higher accuracy in identifying known targets in most tests. Moreover, the confidence targets from targetScore exhibit comparable protein downregulation and are more significantly enriched for Gene Ontology terms. Using targetScore, we explored oncomir-oncogenes network and predicted several potential cancer-related miRNA-messenger RNA interactions. Availability and implementation: TargetScore is available at Bioconductor: http://www.bioconductor.org/packages/devel/bioc/ html/TargetScore.html. Contact: yueli@cs.toronto.edu or zhaolei.zhang@utoronto.ca Supplementary information: Supplementary data are available at Bioinformatics online.
动机:系统鉴定 microRNA(miRNA)靶标仍是一项挑战。miRNA 过表达与全基因组表达谱分析相结合是一种很有前景的新方法,需要一种整合表达和序列信息的新方法。结果:我们开发了一种名为 targetScore 的概率评分方法。TargetScore 根据观察到的目标特征,以贝叶斯后验加权的折叠变化来推断 miRNA 目标。为此,我们从基因表达总库(Gene Expression Omnibus)中收集了 84 个数据集,对应 77 个人体组织或细胞和 113 个不同的转染 miRNA。与其他方法相比,targetScore 在大多数测试中识别已知靶标的准确率明显更高。此外,targetScore 的置信靶标表现出相似的蛋白下调,而且基因本体术语的富集程度更高。利用 targetScore,我们探索了 oncomir-癌基因网络,并预测了几种潜在的癌症相关 miRNA-信使 RNA 相互作用。可用性和实施:TargetScore可从Bioconductor获取:http://www.bioconductor.org/packages/devel/bioc/ html/TargetScore.html。联系方式:yueli@cs.toronto.edu 或 zhaolei.zhang@utoronto.ca 补充信息:补充数据可在 Bioinformatics online 上获取。

Received on July 18, 2013; revised on October 6, 2013; accepted on October 15, 2013
2013年7月18日收到;2013年10月6日修订;2013年10月15日接受

1 INTRODUCTION  1 引言

MicroRNAs (miRNAs) repress protein production in animal species by forming Watson-Crick base pairing to the 3’-untranslated regions of the target messenger RNAs (mRNAs) (Friedman et al., 2009). The binding primarily occurs at the 2 7 nt 2 7 nt 2-7nt2-7 \mathrm{nt} positions from the 5 5 5^(')5^{\prime} end of the miRNA, which is termed as the ‘seed’ and the binding as the ‘seed match’ (Lewis et al., 2003). MiRNA regulations have been implicated in numerous developmental and pathogenic processes (Bartel, 2009). Functional characterization of miRNAs depends on precise identification of their targets. However, it has proved difficult to experimentally identify miRNA-mRNA interactions.
微小RNA(miRNA)通过与目标信使RNA(mRNA)的3'-非翻译区形成Watson-Crick碱基配对,抑制动物体内蛋白质的产生(Friedman等人,2009年)。这种结合主要发生在 miRNA 的 5 5 5^(')5^{\prime} 2 7 nt 2 7 nt 2-7nt2-7 \mathrm{nt} 位置,称为 "种子",而这种结合称为 "种子配对"(Lewis 等,2003 年)。miRNA 的调控与许多发育和致病过程有关(Bartel,2009 年)。miRNA 的功能表征取决于对其靶标的精确鉴定。然而,实验证明很难确定 miRNA 与 mRNA 之间的相互作用。
To date, according to mirTarBase, only 3565 interactions between 432 miRNAs and 1959 target genes in human have
迄今为止,根据 mirTarBase,人类 432 个 miRNA 与 1959 个靶基因之间只有 3565 次相互作用
been confirmed using high-confidence low-throughput assays such as Western blot (Hsu et al., 2011). On the other hand, computational prediction provides a rapid alternative method to identify putative miRNA targets that can be subsequently validated. However, accurate prediction of miRNA targets remains a challenge with the current state-of-the-art algorithms achieving < 50 % < 50 % < 50%<50 \% specificity and having poor agreement among them (Alexiou et al., 2009). Most of these prediction programs are based on sequence complementarity, evolutionary conservation (Friedman et al., 2009; Lewis et al., 2003), free energy (Enright et al., 2004; Krek et al., 2005; Lewis et al., 2003) and/ or target site accessibility (Kertesz et al., 2007). Although it is shown that evolutionary conservation can improve signal-to-noise ratios, the conservation approach is limited, as not all of the functional target sites are conserved (e.g. lineage- and species-specific target sites) and vice versa. In particular, the performance of conservation-based methods drops drastically when restricted to only mammalian genomes because of short evolutionary time (Friedman et al., 2009). Similarly, the energy- or accessibility-based methods rely on secondary structure prediction tools such as RNAfold (Lorenz et al., 2011), which itself has room for improvements. These limitations also underscore the historical lack of genome-wide functional data that measure the in vivo impact of miRNA regulation, which was alleviated by the recent developments of transcriptomic and proteomic profiling methods (Baek et al., 2008; Lim et al., 2005; Selbach et al., 2008).
通过高置信度的低通量检测方法,如 Western 印迹法(Hsu et al.)另一方面,计算预测为确定推定的 miRNA 靶点提供了一种快速的替代方法,这些靶点可随后进行验证。然而,精确预测 miRNA 靶标仍然是一个挑战,目前最先进的算法都能达到 < 50 % < 50 % < 50%<50 \% 特异性,但它们之间的一致性很差(Alexiou 等,2009 年)。这些预测程序大多基于序列互补性、进化保护(Friedman 等人,2009 年;Lewis 等人,2003 年)、自由能(Enright 等人,2004 年;Krek 等人,2005 年;Lewis 等人,2003 年)和/或靶位点可及性(Kertesz 等人,2007 年)。尽管研究表明进化保护能提高信噪比,但保护方法也有局限性,因为并非所有功能性靶位点(如品系和物种特异性靶位点)都得到保护,反之亦然。特别是,由于进化时间较短,当局限于哺乳动物基因组时,基于保守性的方法性能急剧下降(Friedman 等,2009 年)。同样,基于能量或可及性的方法依赖于二级结构预测工具,如 RNAfold(Lorenz 等人,2011 年),而该工具本身还有改进的余地。这些局限性也凸显了过去缺乏测量 miRNA 调控对体内影响的全基因组功能数据,而最近转录组和蛋白质组分析方法的发展缓解了这一问题(Baek 等人,2008 年;Lim 等人,2005 年;Selbach 等人,2008 年)。
Particularly, overexpression of miRNA coupled with expression profiling of mRNA by either microarray or RNA-seq has proved to be a promising approach (Arvey et al., 2010; Lim et al., 2005). Consequently, genome-wide comparison of differential gene expression holds a new promise to elucidate the global impact of a specific miRNA regulation without solely relying on evolutionary conservation. To improve the prediction of relative repression of mRNA, several studies have proposed a series of additional determinants including seed match type (6mer seed, 7 mer-tA1, 7 mer-m8 and 8mer), number of target sites, site relative location on the 3 3 3^(')3^{\prime}-untranslated region, local AU content, 3 '-supplementary paring, seed-pairing stability and target-site abundance (Arvey et al., 2010; Garcia et al., 2011; Grimson et al., 2007), which are combined together in a single value termed as the ‘context score’ made available from targetScan Web site (Garcia et al., 2011). Accordingly, we hypothesize that target prediction can be improved by integrating expression change and sequence information such as context score and
特别是,miRNA 的过表达与微阵列或 RNA-seq 的 mRNA 表达谱分析相结合,已被证明是一种很有前景的方法(Arvey 等人,2010 年;Lim 等人,2005 年)。因此,全基因组的差异基因表达比较为阐明特定 miRNA 调控的全球影响带来了新的希望,而无需完全依赖进化保护。为了改进对 mRNA 相对抑制的预测,一些研究提出了一系列额外的决定因素,包括种子匹配类型(6mer seed、7 mer-tA1、7 mer-m8 和 8mer)、靶位点数量、位点在 3 3 3^(')3^{\prime} 非翻译区的相对位置、局部 AU 含量、3'-补充配对、种子配对稳定性和靶位点丰度(Arvey et al、2010;Garcia 等人,2011;Grimson 等人,2007),它们被合并为一个单一值,称为 "上下文得分",可从 targetScan 网站获取(Garcia 等人,2011)。因此,我们假设,通过整合表达变化和序列信息(如上下文得分和 "目标预测"),可以改进目标预测。

other orthogonal sequence-based features such as conservation (Friedman et al., 2009) into a probabilistic score.
其他基于序列的正交特征(如保持性)(Friedman 等人,2009 年)转化为概率得分。
The proposed model differs from most of the previous expres-sion-based target prediction methods in three important aspects. First, our model is specifically designed for miRNAoverexpression experiments to interrogate targets of a particular miRNA in a specific cell condition. To our knowledge, only a few methods are suitable for such task (Liu et al., 2010a). Most existing methods such as GenMiR++ (Huang et al., 2007) and GroupMiR (Le and Bar-Joseph, 2011) are based on global miRNA-target expression correlation, which requires a large set of expression profiles of both mRNA and miRNA measured across various distinct conditions and may miss targets that are specific to only a subset of the input samples. Second, the proposed model is unsupervised such that it infers miRNA-targets solely based on their distinct high-dimensional patterns of expression fold-changes and sequence features. However, the existing methods are mostly regression-based frameworks by treating gene expression as response and miRNA expression as input variables (Huang et al., 2007; Le and Bar-Joseph, 2011) or supervised learning by explicitly operating on a training dataset of confidence positive and negative targets (Liu et al., 2010b; Sumazin et al., 2011), which are incomplete and difficult to obtain. Third, our method operates on the entire gene set to more closely model the overall likelihood rather than only on a subset of genes prefiltered by targetScan score (Huang et al., 2007) or sample variance (Le and Bar-Joseph, 2011).
所提出的模型在三个重要方面有别于以往大多数基于表达的靶标预测方法。首先,我们的模型是专门为 miRNA 表达实验设计的,目的是检测特定细胞条件下特定 miRNA 的靶标。据我们所知,只有少数方法适用于此类任务(Liu 等,2010a)。现有的大多数方法,如 GenMiR++(Huang 等人,2007 年)和 GroupMiR(Le 和 Bar-Joseph,2011 年),都是基于全局 miRNA-靶点表达相关性的,这需要在各种不同条件下测量大量 mRNA 和 miRNA 的表达谱,可能会漏掉只对输入样本的一部分具有特异性的靶点。其次,所提出的模型是无监督的,它仅根据表达折叠变化和序列特征的不同高维模式来推断 miRNA 目标。然而,现有的方法大多是基于回归的框架,将基因表达作为响应变量,miRNA 表达作为输入变量(Huang 等人,2007 年;Le 和 Bar-Joseph,2011 年),或者是监督学习,明确地在置信阳性和阴性目标的训练数据集上操作(Liu 等人,2010 年 b;Sumazin 等人,2011 年),而这些数据集是不完整的,也很难获得。第三,我们的方法对整个基因集进行操作,从而更接近于对整体可能性建模,而不是只对通过 targetScan 分数(Huang 等人,2007 年)或样本方差(Le 和 Bar-Joseph,2011 年)预过滤的基因子集进行操作。

2 METHODS  2 方法

2.1 Overview  2.1 概述

We describe a novel probabilistic method for miRNA target prediction problem by integrating miRNA-overexpression data and sequence-based scores from other prediction methods. Briefly, each score feature is considered an independent observed variable as input to a variational Bayesian-Gaussian mixture model (VB-GMM). We chose a Bayesian over a maximum likelihood approach to avoid overfitting. Specifically, given expression fold-change (due to miRNA transfection), we use a three-component VB-GMM to infer downregulated targets accounting for genes with little or positive fold-change [due to off-target effects (Khan et al., 2009)]. Otherwise, two-component VB-GMM is applied to unsigned sequence scores. The parameters for the VB-GMM are optimized using variational Bayesian expectation maximization algorithm. The mixture component with the largest absolute means of observed negative fold-change or sequence score is associated with miRNA targets and denoted as ‘target component’. The other components correspond to the ‘background component’. It follows that inferring miRNA-mRNA interactions is equivalent to inferring the posterior distribution of the target component given the observed variables. The targetScore is computed as the sigmoid-transformed fold-change weighted by the averaged posteriors of target components over all of the features (3).
我们介绍了一种新的概率方法,该方法通过整合 miRNA 表达数据和其他预测方法中基于序列的分数来解决 miRNA 目标预测问题。简而言之,每个得分特征都被视为一个独立的观察变量,作为变异贝叶斯-高斯混合模型(VB-GMM)的输入。我们选择贝叶斯方法而不是最大似然方法,以避免过度拟合。具体来说,在给定表达折叠变化(由于 miRNA 转染)的情况下,我们使用一个三分量 VB-GMM 来推断下调靶标,其中包括折叠变化很小或为正的基因(由于脱靶效应(Khan 等人,2009 年))。否则,两分量 VB-GMM 将应用于无符号序列得分。VB-GMM 的参数采用变异贝叶斯期望最大化算法进行优化。观测到的负折叠变化或序列得分绝对均值最大的混合物成分与 miRNA 靶标相关,称为 "靶标成分"。其他成分对应于 "背景成分"。由此可见,推断 miRNA 与 mRNA 之间的相互作用等同于推断目标成分在观测到的变量中的后验分布。targetScore 是根据所有特征的目标成分的平均后验分布加权计算出来的 sigmoid 变形折叠变化(3)。

2.2 Bayesian mixture model
2.2 贝叶斯混合模型

Assuming there are N N NN genes, we denote x = ( x 1 , , x N ) T x = x 1 , , x N T x=(x_(1),dots,x_(N))^(T)\mathbf{x}=\left(x_{1}, \ldots, x_{N}\right)^{T} as the log log log\log expression fold-change ( x f ) x f (x_(f))\left(\mathbf{x}_{f}\right) or sequence scores ( x l ) x l (x_(l))\left(\mathbf{x}_{l}\right). Thus, for L L LL sets of sequence scores, x { x f , x 1 , , x L } x x f , x 1 , , x L xin{x_(f),x_(1),dots,x_(L)}\mathbf{x} \in\left\{\mathbf{x}_{f}, \mathbf{x}_{1}, \ldots, \mathbf{x}_{L}\right\}. To simplify the following equations, we use x x x\mathbf{x} to represent one of the independent variables without loss of generality. To infer target genes for a miRNA given x x x\mathbf{x}, we need to obtain the posterior distribution p ( z x ) p ( z x ) p(z∣x)p(\mathbf{z} \mid \mathbf{x}) of the latent variable z { z 1 , , z K } z z 1 , , z K zin{z_(1),dots,z_(K)}\mathbf{z} \in\left\{z_{1}, \ldots, z_{K}\right\}, where K = 3 ( K = 2 ) K = 3 ( K = 2 ) K=3(K=2)K=3(K=2) for modeling signed (unsigned) scores such as logarithmic fold-changes (sequence scores).
假设有 N N NN 个基因,我们将 x = ( x 1 , , x N ) T x = x 1 , , x N T x=(x_(1),dots,x_(N))^(T)\mathbf{x}=\left(x_{1}, \ldots, x_{N}\right)^{T} 表示为 log log log\log 表达折叠变化 ( x f ) x f (x_(f))\left(\mathbf{x}_{f}\right) 或序列得分 ( x l ) x l (x_(l))\left(\mathbf{x}_{l}\right) 。因此,对于 L L LL 组序列得分, x { x f , x 1 , , x L } x x f , x 1 , , x L xin{x_(f),x_(1),dots,x_(L)}\mathbf{x} \in\left\{\mathbf{x}_{f}, \mathbf{x}_{1}, \ldots, \mathbf{x}_{L}\right\} 。为了简化下面的等式,我们使用 x x x\mathbf{x} 代表其中一个自变量,但不失一般性。要推断给定 x x x\mathbf{x} 的 miRNA 的靶基因,我们需要得到潜变量 z { z 1 , , z K } z z 1 , , z K zin{z_(1),dots,z_(K)}\mathbf{z} \in\left\{z_{1}, \ldots, z_{K}\right\} 的后验分布 p ( z x ) p ( z x ) p(z∣x)p(\mathbf{z} \mid \mathbf{x}) ,其中 K = 3 ( K = 2 ) K = 3 ( K = 2 ) K=3(K=2)K=3(K=2) 用于建模有符号(无符号)分数,如对数折叠变化(序列分数)。
We follow the standard Bayesian GMM based on Bishop (2006, pp. 474 482 474 482 474-482474-482 ) with only minor modifications. Although univariate GMM ( D = 1 D = 1 D=1D=1 ) is applied to each variable separately, we implemented and describe the following formalism as a more general multivariate GMM, allowing modeling the covariance matrices. Briefly, the latent variables z z z\mathbf{z} are sampled at probabilities π π pi\pi (mixing coefficient), that follow a Dirichlet prior Dir ( π α 0 ) Dir π α 0 Dir(pi∣alpha_(0))\operatorname{Dir}\left(\pi \mid \alpha_{0}\right) with hyperparameters α 0 = ( α 0 , 1 , , α 0 , K ) α 0 = α 0 , 1 , , α 0 , K alpha_(0)=(alpha_(0,1),dots,alpha_(0,K))\alpha_{0}=\left(\alpha_{0,1}, \ldots, \alpha_{0, K}\right). To account for the relative frequency of targets and non-targets for any miRNA, we set the α 0 , 1 α 0 , 1 alpha_(0,1)\alpha_{0,1} (associated with the target component) to a N a N aNa N and other α 0 , k = ( 1 a ) × N / ( K 1 ) α 0 , k = ( 1 a ) × N / ( K 1 ) alpha_(0,k)=(1-a)xx N//(K-1)\alpha_{0, k}=(1-a) \times N /(K-1), where a = 0.01 a = 0.01 a=0.01a=0.01 (by default). Assuming x x x\mathbf{x} follows a Gaussian distribution N ( x μ , Λ 1 ) N x μ , Λ 1 N(x∣mu,Lambda^(-1))\mathcal{N}\left(\mathbf{x} \mid \mu, \Lambda^{-1}\right), where Λ Λ Lambda\boldsymbol{\Lambda} (precision matrix) is the inverse covariance matrix, p ( μ , Λ ) p ( μ , Λ ) p(mu,Lambda)p(\mu, \boldsymbol{\Lambda}) together follow a Gaussian-Wishart prior k K N ( μ k m 0 , ( β 0 Λ ) 1 ) W ( Λ k W 0 , ν 0 ) k K N μ k m 0 , β 0 Λ 1 W Λ k W 0 , ν 0 prod_(k)^(K)N(mu_(k)∣m_(0),(beta_(0)Lambda)^(-1))W(Lambda_(k)∣W_(0),nu_(0))\prod_{k}^{K} \mathcal{N}\left(\mu_{k} \mid \mathbf{m}_{0},\left(\beta_{0} \boldsymbol{\Lambda}\right)^{-1}\right) \mathcal{W}\left(\Lambda_{k} \mid \mathbf{W}_{0}, \nu_{0}\right), where the hyperparameters { m 0 , β 0 , W 0 , v 0 } = { μ ^ , 1 , I D × D , D + 1 } m 0 , β 0 , W 0 , v 0 = μ ^ , 1 , I D × D , D + 1 {m_(0),beta_(0),W_(0),v_(0)}={( hat(mu)),1,I_(D xx D),D+1}\left\{\mathbf{m}_{0}, \beta_{0}, \mathbf{W}_{0}, v_{0}\right\}=\left\{\hat{\mu}, 1, \mathbf{I}_{D \times D}, D+1\right\}.
我们沿用基于 Bishop(2006 年,第 474 482 474 482 474-482474-482 页)的标准贝叶斯 GMM,只做了细微修改。虽然单变量 GMM ( D = 1 D = 1 D=1D=1 ) 是分别应用于每个变量的,但我们将下面的形式主义作为更一般的多变量 GMM 来实施和描述,允许对协方差矩阵建模。简而言之,潜变量 z z z\mathbf{z} 是以概率 π π pi\pi (混合系数)采样的,该概率遵循超参数 α 0 = ( α 0 , 1 , , α 0 , K ) α 0 = α 0 , 1 , , α 0 , K alpha_(0)=(alpha_(0,1),dots,alpha_(0,K))\alpha_{0}=\left(\alpha_{0,1}, \ldots, \alpha_{0, K}\right) 的 Dirichlet 先验 Dir ( π α 0 ) Dir π α 0 Dir(pi∣alpha_(0))\operatorname{Dir}\left(\pi \mid \alpha_{0}\right) 。为了考虑任何 miRNA 目标和非目标的相对频率,我们将 α 0 , 1 α 0 , 1 alpha_(0,1)\alpha_{0,1} (与目标成分相关)设为 a N a N aNa N ,将其他 α 0 , k = ( 1 a ) × N / ( K 1 ) α 0 , k = ( 1 a ) × N / ( K 1 ) alpha_(0,k)=(1-a)xx N//(K-1)\alpha_{0, k}=(1-a) \times N /(K-1) 设为 a = 0.01 a = 0.01 a=0.01a=0.01 (默认)。假设 x x x\mathbf{x} 遵循高斯分布 N ( x μ , Λ 1 ) N x μ , Λ 1 N(x∣mu,Lambda^(-1))\mathcal{N}\left(\mathbf{x} \mid \mu, \Lambda^{-1}\right) ,其中 Λ Λ Lambda\boldsymbol{\Lambda} (精度矩阵)是逆协方差矩阵, p ( μ , Λ ) p ( μ , Λ ) p(mu,Lambda)p(\mu, \boldsymbol{\Lambda}) 一起遵循高斯-Wishart 先验 k K N ( μ k m 0 , ( β 0 Λ ) 1 ) W ( Λ k W 0 , ν 0 ) k K N μ k m 0 , β 0 Λ 1 W Λ k W 0 , ν 0 prod_(k)^(K)N(mu_(k)∣m_(0),(beta_(0)Lambda)^(-1))W(Lambda_(k)∣W_(0),nu_(0))\prod_{k}^{K} \mathcal{N}\left(\mu_{k} \mid \mathbf{m}_{0},\left(\beta_{0} \boldsymbol{\Lambda}\right)^{-1}\right) \mathcal{W}\left(\Lambda_{k} \mid \mathbf{W}_{0}, \nu_{0}\right) ,其中超参数 { m 0 , β 0 , W 0 , v 0 } = { μ ^ , 1 , I D × D , D + 1 } m 0 , β 0 , W 0 , v 0 = μ ^ , 1 , I D × D , D + 1 {m_(0),beta_(0),W_(0),v_(0)}={( hat(mu)),1,I_(D xx D),D+1}\left\{\mathbf{m}_{0}, \beta_{0}, \mathbf{W}_{0}, v_{0}\right\}=\left\{\hat{\mu}, 1, \mathbf{I}_{D \times D}, D+1\right\}

2.3 Variational Bayesian expectation maximization
2.3 变式贝叶斯期望最大化

Let θ = { z , π , μ , Λ } θ = { z , π , μ , Λ } theta={z,pi,mu,Lambda}\theta=\{\mathbf{z}, \pi, \mu, \boldsymbol{\Lambda}\}. The marginal log likelihood can be written in terms of lower bound L ( q ) L ( q ) L(q)\mathcal{L}(q) (first term) and Kullback-Leibler divergence K L ( q p ) K L ( q p ) KL(q||p)\mathcal{K} \mathcal{L}(q \| p) (second term):
θ = { z , π , μ , Λ } θ = { z , π , μ , Λ } theta={z,pi,mu,Lambda}\theta=\{\mathbf{z}, \pi, \mu, \boldsymbol{\Lambda}\} 。边际对数似然可以用下界 L ( q ) L ( q ) L(q)\mathcal{L}(q) (第一项)和库尔贝-莱布勒发散 K L ( q p ) K L ( q p ) KL(q||p)\mathcal{K} \mathcal{L}(q \| p) (第二项)来表示:
ln p ( x ) = q ( θ ) ln p ( x , θ ) q ( θ ) + q ( θ ) ln q ( θ ) p ( θ x ) ln p ( x ) = q ( θ ) ln p ( x , θ ) q ( θ ) + q ( θ ) ln q ( θ ) p ( θ x ) ln p(x)=int q(theta)ln((p(x,theta))/(q(theta)))+int q(theta)ln((q(theta))/(p(theta∣x)))\ln p(\mathbf{x})=\int q(\theta) \ln \frac{p(\mathbf{x}, \theta)}{q(\theta)}+\int q(\theta) \ln \frac{q(\theta)}{p(\theta \mid \mathbf{x})}
where q ( θ ) q ( θ ) q(theta)q(\theta) is a proposed distribution for p ( θ x ) p ( θ x ) p(theta∣x)p(\theta \mid \mathbf{x}), which does not have a closed form distribution. Because ln p ( x ) ln p ( x ) ln p(x)\ln p(\mathbf{x}) is a constant, maximizing L ( q ) L ( q ) L(q)\mathcal{L}(q) implies minimizing K L ( q p ) K L ( q p ) KL(q||p)\mathcal{K} \mathcal{L}(q \| p). The general optimal solution ln q j ( θ j ) ln q j θ j ln q_(j)^(**)(theta_(j))\ln q_{j}^{*}\left(\theta_{j}\right) is the expectation of variable j j jj with respect to other variables, E i j [ ln p ( x , θ ) ] E i j [ ln p ( x , θ ) ] E_(i!=j)[ln p(x,theta)]\mathbb{E}_{i \neq j}[\ln p(\mathbf{x}, \theta)]. In particular, we define q ( z , π , μ , Λ ) = q ( z ) q ( π ) q ( μ , Λ ) q ( z , π , μ , Λ ) = q ( z ) q ( π ) q ( μ , Λ ) q(z,pi,mu,Lambda)=q(z)q(pi)q(mu,Lambda)q(\mathbf{z}, \pi, \mu, \boldsymbol{\Lambda})=q(\mathbf{z}) q(\pi) q(\mu, \boldsymbol{\Lambda}). The expectations for the three terms (at log log log\log scale), namely, ln q ( z ) , ln q ( π ) , ln q ( μ ) ln q ( z ) , ln q ( π ) , ln q ( μ ) ln q^(**)(z),ln q^(**)(pi),ln q^(**)(mu)\ln q^{*}(\mathbf{z}), \ln q^{*}(\pi), \ln q^{*}(\mu), have the same forms as the initial distributions due to the conjugacy of the priors. However, they require evaluation of the parameters { z , π , μ , Λ } { z , π , μ , Λ } {z,pi,mu,Lambda}\{\mathbf{z}, \pi, \mu, \boldsymbol{\Lambda}\}, which in turn all depend on the expectations of z z z\mathbf{z} or the posterior of interest:
其中, q ( θ ) q ( θ ) q(theta)q(\theta) p ( θ x ) p ( θ x ) p(theta∣x)p(\theta \mid \mathbf{x}) 的拟议分布,而 p ( θ x ) p ( θ x ) p(theta∣x)p(\theta \mid \mathbf{x}) 并没有封闭形式的分布。因为 ln p ( x ) ln p ( x ) ln p(x)\ln p(\mathbf{x}) 是一个常数,所以最大化 L ( q ) L ( q ) L(q)\mathcal{L}(q) 意味着最小化 K L ( q p ) K L ( q p ) KL(q||p)\mathcal{K} \mathcal{L}(q \| p) 。一般最优解 ln q j ( θ j ) ln q j θ j ln q_(j)^(**)(theta_(j))\ln q_{j}^{*}\left(\theta_{j}\right) 是变量 j j jj 相对于其他变量 E i j [ ln p ( x , θ ) ] E i j [ ln p ( x , θ ) ] E_(i!=j)[ln p(x,theta)]\mathbb{E}_{i \neq j}[\ln p(\mathbf{x}, \theta)] 的期望值。我们特别定义了 q ( z , π , μ , Λ ) = q ( z ) q ( π ) q ( μ , Λ ) q ( z , π , μ , Λ ) = q ( z ) q ( π ) q ( μ , Λ ) q(z,pi,mu,Lambda)=q(z)q(pi)q(mu,Lambda)q(\mathbf{z}, \pi, \mu, \boldsymbol{\Lambda})=q(\mathbf{z}) q(\pi) q(\mu, \boldsymbol{\Lambda}) 。由于先验的共轭性,三项的期望(在 log log log\log 尺度上),即 ln q ( z ) , ln q ( π ) , ln q ( μ ) ln q ( z ) , ln q ( π ) , ln q ( μ ) ln q^(**)(z),ln q^(**)(pi),ln q^(**)(mu)\ln q^{*}(\mathbf{z}), \ln q^{*}(\pi), \ln q^{*}(\mu) ,与初始分布具有相同的形式。但是,它们需要对参数 { z , π , μ , Λ } { z , π , μ , Λ } {z,pi,mu,Lambda}\{\mathbf{z}, \pi, \mu, \boldsymbol{\Lambda}\} 进行评估,而这些参数又都取决于 z z z\mathbf{z} 或相关后验的期望:
p ( z n k x n , θ ) E [ z n k ] = ρ n k j = 1 K ρ n j p z n k x n , θ E z n k = ρ n k j = 1 K ρ n j p(z_(nk)∣x_(n),theta)-=E[z_(nk)]=(rho_(nk))/(sum_(j=1)^(K)rho_(nj))p\left(z_{n k} \mid \mathbf{x}_{n}, \theta\right) \equiv \mathbb{E}\left[z_{n k}\right]=\frac{\rho_{n k}}{\sum_{j=1}^{K} \rho_{n j}}
where  其中
ln ρ n k = E [ ln π k ] + 1 2 E [ ln | Λ k | ] D 2 ln ( 2 π ) 1 2 E μ k , A k [ ( x n μ k ) T Λ k ( x n ln ρ n k = E ln π k + 1 2 E ln Λ k D 2 ln ( 2 π ) 1 2 E μ k , A k x n μ k T Λ k x n ln rho_(nk)=E[ln pi_(k)]+(1)/(2)E[ln|Lambda_(k)|]-(D)/(2)ln(2pi)-(1)/(2)E_(mu_(k),A_(k))[(x_(n)-mu_(k))^(T)Lambda_(k)(x_(n)-:}\ln \rho_{n k}=\mathbb{E}\left[\ln \pi_{k}\right]+\frac{1}{2} \mathbb{E}\left[\ln \left|\boldsymbol{\Lambda}_{k}\right|\right]-\frac{D}{2} \ln (2 \pi)-\frac{1}{2} \mathbb{E}_{\mu_{k}, \boldsymbol{A}_{k}}\left[\left(\mathbf{x}_{n}-\mu_{k}\right)^{T} \boldsymbol{\Lambda}_{k}\left(\mathbf{x}_{n}-\right.\right. μ k ) ] μ k {:mu_(k))]\left.\left.\mu_{k}\right)\right]. The inter-dependence between the expectations and model parameters falls naturally into an expectation-maximization (EM) framework, namely variational Bayesian expectation maximization. Briefly, we first initialize the model parameters based on priors and randomly sample K K KK data points μ μ mu\mu. At the i t t h i t t h i^(t^(th))i^{t^{t h}} iteration, we evaluate (2) using the model parameters (VB-E step) and update the model parameters using (2) (VB-M step). The EM iteration terminates when L ( q ) L ( q ) L(q)\mathcal{L}(q) improves by < 10 20 < 10 20 < 10^(-20)<10^{-20} (default). Please refer to Bishop (2006) for more details.
ln ρ n k = E [ ln π k ] + 1 2 E [ ln | Λ k | ] D 2 ln ( 2 π ) 1 2 E μ k , A k [ ( x n μ k ) T Λ k ( x n ln ρ n k = E ln π k + 1 2 E ln Λ k D 2 ln ( 2 π ) 1 2 E μ k , A k x n μ k T Λ k x n ln rho_(nk)=E[ln pi_(k)]+(1)/(2)E[ln|Lambda_(k)|]-(D)/(2)ln(2pi)-(1)/(2)E_(mu_(k),A_(k))[(x_(n)-mu_(k))^(T)Lambda_(k)(x_(n)-:}\ln \rho_{n k}=\mathbb{E}\left[\ln \pi_{k}\right]+\frac{1}{2} \mathbb{E}\left[\ln \left|\boldsymbol{\Lambda}_{k}\right|\right]-\frac{D}{2} \ln (2 \pi)-\frac{1}{2} \mathbb{E}_{\mu_{k}, \boldsymbol{A}_{k}}\left[\left(\mathbf{x}_{n}-\mu_{k}\right)^{T} \boldsymbol{\Lambda}_{k}\left(\mathbf{x}_{n}-\right.\right. μ k ) ] μ k {:mu_(k))]\left.\left.\mu_{k}\right)\right] 。期望值和模型参数之间的相互依存关系很自然地归入期望最大化(EM)框架,即变异贝叶斯期望最大化。简而言之,我们首先根据前验初始化模型参数,然后随机采样 K K KK 数据点 μ μ mu\mu 。在 i t t h i t t h i^(t^(th))i^{t^{t h}} 迭代中,我们使用模型参数对 (2) 进行评估(VB-E 步骤),并使用 (2) 更新模型参数(VB-M 步骤)。当 L ( q ) L ( q ) L(q)\mathcal{L}(q) 改善 < 10 20 < 10 20 < 10^(-20)<10^{-20} (默认值)时,EM迭代结束。详情请参考 Bishop (2006)。

2.4 TargetScore  2.4 目标分数

We define the targetScore as an integrative probabilistic score of a gene being the target t t tt of a miRNA:
我们将 targetScore 定义为基因成为 miRNA 靶标 t t tt 的综合概率得分:
targetScore = σ ( log F C ) ( 1 K + 1 x { x f , x 1 , , x L } p ( t x ) ) where σ ( log F C ) = 1 1 + exp ( log F C ) , p ( t x ) is the posterior in ( 2 ) .  targetScore  = σ ( log F C ) 1 K + 1 x x f , x 1 , , x L p ( t x )  where  σ ( log F C ) = 1 1 + exp ( log F C ) , p ( t x )  is the posterior in  ( 2 ) . {:[" targetScore "=sigma(-log FC)((1)/(K+1)sum_(xin{x_(f),x_(1),dots,x_(L)})p(t∣x))],[" where "sigma(-log FC)=(1)/(1+exp(log FC))","quad p(t∣x)" is the posterior in "(2).]:}\begin{array}{r} \text { targetScore }=\sigma(-\log F C)\left(\frac{1}{K+1} \sum_{\mathbf{x} \in\left\{\mathbf{x}_{f}, \mathbf{x}_{1}, \ldots, \mathbf{x}_{L}\right\}} p(t \mid \mathbf{x})\right) \\ \text { where } \sigma(-\log F C)=\frac{1}{1+\exp (\log F C)}, \quad p(t \mid \mathbf{x}) \text { is the posterior in }(2) . \end{array}

2.5 miRNA-overexpression data collection
2.5 miRNA 表达数据收集

We collected miRNA-overexpression data corresponding to 84 Gene Expression Omnibus (GEO) series, 6 platforms, 77 human cells or tissues and 113 distinct miRNAs (Supplementary Table S1). To our knowledge, this is by far the largest compendium of miRNA-overexpression data. To automate data downloading and processing, we developed a pipeline written in R, making use of the function getGEO from GEOquery R/Bioconductor package (Davis and Meltzer, 2007). For each dataset, the pipeline downloads and quantile normalizes (if necessary) the expression data and calculates (when necessary) the log fold-change (logFC) in
我们收集了 84 个基因表达总库(GEO)系列、6 个平台、77 个人类细胞或组织和 113 个不同 miRNA 的 miRNA 表达数据(补充表 S1)。据我们所知,这是迄今为止最大的 miRNA 表达数据汇编。为了实现数据下载和处理的自动化,我们利用 GEOquery R/Bioconductor 软件包(Davis 和 Meltzer,2007 年)中的 getGEO 函数,开发了一个用 R 语言编写的管道。对于每个数据集,该管道都会下载表达数据并进行量化归一化处理(如有必要),并计算(如有必要)其对折变化(logFC)。

treatment (miRNA transfected) versus (mock) control. For mRNAs interrogated by multiple probes in a single experiment, we took the average of the fold-changes. Each of the 286 data vectors contains logFC logFC logFC\operatorname{logFC} values for N 18560 N 18560 N <= 18560N \leq 18560 RefSeq mRNAs (i.e. with prefix NM for known protein coding mRNA obtained from University of California, Santa Cruz) due to transfection of one of the 113 miRNAs. For two logFC logFC logFC\operatorname{logFC} vectors associated with the same miRNA transfection (conducted in different studies), we removed mRNAs for which neither of the vectors contains a logFC logFC logFC\operatorname{logFC} value and filled the remaining missing values in one vector with the non-missing values in the other. For more than two logFC logFC logFC\operatorname{logFC} vectors for the same miRNA transfection, we removed mRNAs absent in all of those vectors and performed imputation for the remaining missing values using impute. knn from impute R package (Troyanskaya et al., 2001). Finally, we picked one representative logFC logFC logFC\operatorname{logFC} vector per miRNA, which has the highest Pearson correlation with the binary vector of the validated targets (Hsu et al., 2011) or averaged them if no validated target was available. As a result, we have 113 logFC 113 logFC 113 logFC113 \operatorname{logFC} data vectors for 113 distinct miRNA transfections.
处理(转染 miRNA)与(模拟)对照。对于在一次实验中使用多个探针检测的 mRNA,我们取折叠变化的平均值。在 286 个数据载体中,每个载体都包含因转染 113 个 miRNA 中的一个而导致的 logFC logFC logFC\operatorname{logFC} N 18560 N 18560 N <= 18560N \leq 18560 RefSeq mRNA(即前缀为 NM 的已知蛋白质编码 mRNA,从加州大学圣克鲁兹分校获得)的 logFC logFC logFC\operatorname{logFC} 值。对于与同一 miRNA 转染相关的两个 logFC logFC logFC\operatorname{logFC} 载体(在不同研究中进行),我们剔除了两个载体都不包含 logFC logFC logFC\operatorname{logFC} 值的 mRNA,并用另一个载体中的非缺失值填补了一个载体中的剩余缺失值。对于同一 miRNA 转染的两个以上 logFC logFC logFC\operatorname{logFC} 向量,我们删除了所有这些向量中缺失的 mRNA,并使用 impute R 软件包(Troyanskaya 等人,2001 年)中的 impute.最后,我们为每个 miRNA 挑选了一个具有代表性的 logFC logFC logFC\operatorname{logFC} 向量,它与验证靶标的二元向量具有最高的皮尔逊相关性(Hsu 等,2011 年),如果没有验证靶标,则取平均值。因此,我们得到了 113 个不同 miRNA 转染的 113 logFC 113 logFC 113 logFC113 \operatorname{logFC} 数据载体。

2.6 Comparison with other prediction methods
2.6 与其他预测方法的比较

We compared targetScore with seven published methods (Table 1). Among these methods, Expmicro is the only method that also uses miRNA-overexpression data. The other six methods are sequencebased, and their predictions are fine-tuned by the authors and the results provided on their Web site are the most accurate ones. Thus, we only ran Expmicro locally on the same test data and directly downloaded the target predictions by the other six programs from their corresponding Web sites. Specifically, predictions on all target sites (including conserved and non-conserved sites) from targetScan 6.1 were obtained for both context + scores (CS) and probability of conserved targeting (PCT). The latest PicTar2 target predictions were obtained from doRiNA database (database of RNA interactions in post-transcriptional regulation) (Anders et al., 2011). The miRanda predictions for sites with both good and non-good mirSVR scores were obtained at http://www.micro rna.org/. For Probability of Interaction by Target Accessibility (PITA) predictions, the accessibility energy Δ Δ G Δ Δ G Delta Delta G\Delta \Delta G on target sites flanked by 3 / 15 3 / 15 3//153 / 15 nt upstream/downstream (’ 3 / 15 3 / 15 3//153 / 15 flank’) was chosen as recommended by the authors (Kertesz et al., 2007). SVMicro predictions were obtained from http://compgenomics.utsa.edu/svmicro.html. Expmicro was run on the SVMicro scores and expression fold-changes (Liu et al., 2010a).
我们将 targetScore 与七种已发表的方法进行了比较(表 1)。在这些方法中,Expmicro 是唯一同时使用 miRNA 表达数据的方法。其他六种方法都是基于序列的,其预测结果由作者进行了微调,其网站上提供的结果是最准确的。因此,我们只在本地对相同的测试数据运行了 Expmicro,并直接从相应的网站上下载了其他六种程序的靶标预测结果。具体来说,我们从 targetScan 6.1 中获得了所有目标位点(包括保守位点和非保守位点)的上下文+分数(CS)和保守目标概率(PCT)预测结果。最新的 PicTar2 目标预测来自 doRiNA 数据库(转录后调控中的 RNA 相互作用数据库)(Anders 等人,2011 年)。具有良好和非良好 mirSVR 评分的位点的 miRanda 预测可从 http://www.micro rna.org/获得。对于目标可及性相互作用概率(PITA)预测,根据作者的建议(Kertesz 等,2007 年),选择了上游/下游侧翼为 3 / 15 3 / 15 3//153 / 15 nt 的目标位点的可及性能量 Δ Δ G Δ Δ G Delta Delta G\Delta \Delta G (" 3 / 15 3 / 15 3//153 / 15 flank")。SVMicro 预测结果来自 http://compgenomics.utsa.edu/svmicro.html。Expmicro 在 SVMicro 分数和表达折叠变化上运行(Liu 等人,2010a)。

2.7 Evaluation  2.7 评估

2.7.1 Gold standard Experimentally validated targets were downloaded from mirTarBase (Hsu et al., 2011) (http://mirtarbase.mbc.nctu. edu.tw). MirTarBase is one of the most comprehensive databases that support bulk download and agree with other databases such as TarBase (Vergoulis et al., 2012) and miRecords (Xiao et al., 2009). MirTarBase 3.5 contains 3565 validated interactions between 432 human miRNAs and 1959 target genes. The average (median) number of targets per miRNA is 8.2 (3).
2.7.1 金标准 从 mirTarBase(Hsu 等,2011 年)(http://mirtarbase.mbc.nctu. edu.tw)下载实验验证的靶标。MirTarBase 是支持批量下载的最全面的数据库之一,与 TarBase (Vergoulis et al., 2012) 和 miRecords (Xiao et al., 2009) 等其他数据库一致。MirTarBase 3.5 包含 432 个人类 miRNA 与 1959 个靶基因之间的 3565 次有效相互作用。每个 miRNA 的靶基因平均数(中位数)为 8.2 (3)。

2.7.2 Methods of comparison We first compared the performances of using logFC logFC logFC\operatorname{logFC} and the six methods that use sequence information alone (Table 1) in discriminating validated targets from the rest. To have a reliable estimate, we selected the test data corresponding to miRNAs that have at least three validated targets from mirTarBase and at least 1 predicted target from each of the predictors. Based on the results, we then picked two overall best sequence-based methods. The scores from those two methods were then integrated along with logFC logFC logFC\operatorname{logFC} into the proposed targetScore Equation (3). Next, we modified our testing set by including miRNAs that have at least one target predicted from each of the two best-performing sequence-based methods and at least 10 validated targets from mirTarBase. Finally, we evaluated the performances of log FC log FC log FC\log \mathrm{FC} and sequence-based methods alone in comparison with Expmicro (Liu et al., 2010a) and the proposed targetScore.
2.7.2 比较方法 我们首先比较了使用 logFC logFC logFC\operatorname{logFC} 和仅使用序列信息的六种方法(表 1)在区分有效靶标和其他靶标方面的表现。为了得到可靠的估计结果,我们选择了与至少有三个来自 mirTarBase 的验证靶标和至少有一个来自每个预测因子的预测靶标的 miRNA 相对应的测试数据。根据结果,我们选出了两种基于序列的总体最佳方法。然后将这两种方法的得分与 logFC logFC logFC\operatorname{logFC} 整合到拟议的 targetScore 公式 (3) 中。接下来,我们对测试集进行了修改,纳入了至少有一种靶标是由两种表现最好的基于序列的方法预测的,并且至少有 10 个靶标是由 mirTarBase 验证的 miRNA。最后,我们将 log FC log FC log FC\log \mathrm{FC} 和基于序列的方法单独与 Expmicro(Liu 等人,2010a)和建议的 targetScore 进行了比较,评估了它们的性能。

2.7.3 Systematic evaluation For each comparison, the sensitivity and specificity of the methods were systematically assessed using receiver operating characteristic (ROC) curve and summarized by the area under the curve (AUC) (Fig. 2). For a given score cutoff, the true- and false-positive rates (TPR and FPR) are estimated as the respective ratios of TPR = TP / P = TP / P =TP//P=\mathrm{TP} / \mathrm{P} and FPR = FP / N FPR = FP / N FPR=FP//N\mathrm{FPR}=\mathrm{FP} / \mathrm{N}, where TP and FP are the numbers of true and false positives, and P and N are the total numbers of positive and negative miRNA targets within the test data. Predicted targets outside of the test data were not counted. ROC is plotted by iteratively evaluating TPR ( y y yy-axis) and FPR ( x x xx-axis) while relaxing the scoring cutoff. In addition, we constructed precision-recall (PR) curve (PRC) and assessed the precision TP / ( TP + FP ) TP / ( TP + FP ) TP//(TP+FP)\mathrm{TP} /(\mathrm{TP}+\mathrm{FP}) of each method at the same recall (or TPR ). Comparing to ROC, PR is more informative in evaluating the performance at the top predictions. Both ROC and PR statistics were obtained using ROCR package (Sing et al., 2005).
2.7.3 系统评估 对于每种比较方法,均使用接收器操作特征曲线(ROC)对其灵敏度和特异性进行系统评估,并以曲线下面积(AUC)进行总结(图 2)。对于给定的得分临界值,真阳性率和假阳性率(TPR 和 FPR)分别按 TPR = TP / P = TP / P =TP//P=\mathrm{TP} / \mathrm{P} FPR = FP / N FPR = FP / N FPR=FP//N\mathrm{FPR}=\mathrm{FP} / \mathrm{N} 的比率估算,其中 TP 和 FP 是真阳性和假阳性的数量,P 和 N 是测试数据中阳性和阴性 miRNA 靶点的总数。测试数据之外的预测靶标不计算在内。ROC 是通过迭代评估 TPR( y y yy 轴)和 FPR( x x xx 轴)绘制的,同时放宽了评分临界值。此外,我们还构建了精确度-召回率(PR)曲线(PRC),并评估了每种方法在相同召回率(或 TPR)下的精确度 TP / ( TP + FP ) TP / ( TP + FP ) TP//(TP+FP)\mathrm{TP} /(\mathrm{TP}+\mathrm{FP}) 。与 ROC 相比,PR 在评估最高预测性能方面更有参考价值。ROC 和 PR 统计数据都是使用 ROCR 软件包(Sing 等人,2005 年)获得的。

2.7.4 Evaluation using proteomic data To further evaluate the performance of each method, we used the available protein output data
2.7.4 使用蛋白质组数据进行评估 为了进一步评估每种方法的性能,我们使用了现有的蛋白质输出数据
Table 1. Comparison between miRNA target prediction methods
表 1.miRNA 靶标预测方法比较
Method  方法 Seed match  种子匹配 Energy  能源 Conservation  环境保护 Context score a a ^(a)^{\mathrm{a}}  背景得分 a a ^(a)^{\mathrm{a}} SVMicro score  SVMicro 分数 P C T b P C T b P_(CT)^(b)P_{C T}{ }^{\mathrm{b}} logFC c logFC c logFC^(c)\operatorname{logFC}^{\mathrm{c}} References  参考资料
TargetScanPCT \checkmark \checkmark \checkmark Friedman et al. (2009)
弗里德曼等人(2009 年)
TargetScanCS \checkmark \checkmark \checkmark Garcia et all. (2011)
加西亚等人(2011(2011)
PicTar2 \checkmark \checkmark \checkmark Anders et al. (2011)
安德斯等人(2011 年)
miRanda \checkmark \checkmark Enright et al. (2004)
恩莱特等人(2004 年)
PITA \checkmark \checkmark Kertesz et al. (2007)
Kertesz 等人(2007 年)
SVMicro

Liu 等人(2010b) Liu 等人(2010a) Expmicro
Liu et al. (2010b)
Liu et al. (2010a)
Expmicro
Liu et al. (2010b) Liu et al. (2010a) Expmicro| Liu et al. (2010b) | | :--- | | Liu et al. (2010a) | | Expmicro |
PargetScore  目标分数 \checkmark
Method Seed match Energy Conservation Context score ^(a) SVMicro score P_(CT)^(b) logFC^(c) References TargetScanPCT ✓ ✓ ✓ Friedman et al. (2009) TargetScanCS ✓ ✓ ✓ Garcia et all. (2011) PicTar2 ✓ ✓ ✓ Anders et al. (2011) miRanda ✓ ✓ Enright et al. (2004) PITA ✓ ✓ Kertesz et al. (2007) SVMicro "Liu et al. (2010b) Liu et al. (2010a) Expmicro" PargetScore ✓ | Method | Seed match | Energy | Conservation | Context score $^{\mathrm{a}}$ | SVMicro score | $P_{C T}{ }^{\mathrm{b}}$ | $\operatorname{logFC}^{\mathrm{c}}$ | References | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | TargetScanPCT | $\checkmark$ | | $\checkmark$ | | | $\checkmark$ | | Friedman et al. (2009) | | TargetScanCS | $\checkmark$ | | $\checkmark$ | $\checkmark$ | | | Garcia et all. (2011) | | | PicTar2 | $\checkmark$ | $\checkmark$ | $\checkmark$ | | | | Anders et al. (2011) | | | miRanda | | $\checkmark$ | $\checkmark$ | | | Enright et al. (2004) | | | | PITA | $\checkmark$ | $\checkmark$ | | | | Kertesz et al. (2007) | | | | SVMicro | | | | | | Liu et al. (2010b) <br> Liu et al. (2010a) <br> Expmicro | | | | PargetScore | | | | | $\checkmark$ | | | |
following overexpression of hsa-miR-1, 124 and 181 (Baek et al., 2008; Liu et al., 2010a). Despite higher cost, protein abundance is a more accurate indicator of miRNA regulation than mRNA level, as miRNA is known to cause not only mRNA degradation but also translational repression. Presumably, a better prediction algorithm should rank the top targets with larger negative protein fold-changes. Thus, we selected the top 200 target predictions from selected prediction methods and plotted the cumulative sum of the protein log 2 log 2 log 2\log 2 fold-change as a function of their rankings (Liu et al., 2010a). At the same ranking, a superior method is expected to reach greater protein downregulation, indicated by a steeper curve (Fig. 3).
在过量表达 hsa-miR-1、124 和 181 后,蛋白质丰度会降低(Baek 等人,2008 年;Liu 等人,2010a)。尽管成本较高,但蛋白质丰度是比 mRNA 水平更准确的 miRNA 调控指标,因为众所周知,miRNA 不仅会导致 mRNA 降解,还会导致翻译抑制。据推测,更好的预测算法应该将负蛋白折叠变化较大的靶标排在前列。因此,我们从选定的预测方法中选出了前 200 个靶标预测结果,并绘制了蛋白质 log 2 log 2 log 2\log 2 折叠变化的累积总和与其排名的函数关系图(Liu 等人,2010a)。在排名相同的情况下,优越的方法有望达到更大的蛋白质下调,这表现为曲线更陡(图 3)。

2.7.5 GO enrichment analysis As an additional metric previously used by Huang et al. (2007), we examined whether the miRNA targets predicted by each method are biologically meaningful via Gene ontology (GO) enrichment analysis. Specifically, GO terms in biological processes (BP) (GO-BP) were downloaded using getBM function from R package biomaRt (Durinck et al., 2009), where GO terms with fewer than five genes or with evidence codes equal to Inferred from Electronic Annotation (IEA), Non-traceable Author Statement (NAS) or No biological Data available (ND) were discarded, giving 1717 GO-BP terms and 10222 unique genes. Based on ROC analysis (Section 2.7.3), we chose for each method a specific scoring cutoff that resulted in FPR < 0.3 < 0.3 < 0.3<0.3 and used such cutoff to select genes with equal or higher scores. The resulting list of Ensembl gene IDs for each method was then subjected to hypergeometric enrichment test for each GO-BP term using R built-in function phyper. The corrected P P PP-values or false discovery rate (FDR) with R function p . adjust was converted to enrichment scores as log 10 ( FDR ) log 10 ( FDR ) -log_(10)(FDR)-\log _{10}(\mathrm{FDR}). Cumulative density plot was constructed as a function of the enrichment scores (Fig. 4B).
2.7.5 GO 富集分析 作为 Huang 等人(2007 年)曾使用过的附加指标,我们通过基因本体(Gene ontology,GO)富集分析来检验每种方法预测的 miRNA 靶标是否具有生物学意义。具体来说,我们使用 R 软件包 biomaRt(Durinck 等人,2009 年)中的 getBM 函数下载了生物过程(BP)中的 GO 术语(GO-BP),其中剔除了少于 5 个基因或证据代码为 Inferred from Electronic Annotation (IEA)、Non-traceable Author Statement (NAS) 或 No biological Data available (ND) 的 GO 术语,从而得到了 1717 个 GO-BP 术语和 10222 个唯一基因。根据 ROC 分析(第 2.7.3 节),我们为每种方法选择了一个特定的评分临界值,即 FPR < 0.3 < 0.3 < 0.3<0.3 ,并使用该临界值选择得分相同或更高的基因。然后使用 R 内置函数 phyper 对每种方法得到的 Ensembl 基因 ID 列表中的每个 GO-BP 项进行超几何富集检验。使用 R 函数 p .adjust 将校正后的 P P PP 值或误诊率(FDR)转换为富集得分 log 10 ( FDR ) log 10 ( FDR ) -log_(10)(FDR)-\log _{10}(\mathrm{FDR}) 。根据富集分构建累积密度图(图 4B)。

3 RESULTS  3 结果

3.1 Constructing sequence-based predictors
3.1 构建基于序列的预测器

Among the 113 overexpressed miRNAs, 11 of them have at least 3 validated and 1 predicted target from the 6 sequence-based methods (Table 1). Using these test data, we established that the best-overall sequence-based methods are targetScanCS and targetScanPCT in terms of the AUCs of ROC and PR (Supplementary Fig. S1). Accordingly, we constructed the proposed targetScore by integrating logFC and sequence scores from targetScanCS and targetScanPCT into Equation (3). Modifying the test cases by including all of the overexpressed miRNAs that have at least one predicted targets in targetScanCS, targetScanPCT and SVMicro (required as input to Expmicro), we obtained 35 miRNAs for subsequent comparisons.
在 113 个过表达的 miRNA 中,有 11 个至少有 3 个通过 6 种基于序列的方法验证的靶点和 1 个预测的靶点(表 1)。利用这些测试数据,我们确定从 ROC 和 PR 的 AUC 值来看,基于序列的方法中最好的是 targetScanCS 和 targetScanPCT(补充图 S1)。因此,我们将 targetScanCS 和 targetScanPCT 的 logFC 和序列得分整合到公式(3)中,构建了建议的 targetScore。我们将所有在 targetScanCS、targetScanPCT 和 SVMicro(作为 Expmicro 的输入)中至少有一个预测靶点的过表达 miRNA 都纳入测试病例,从而得到了 35 个 miRNA 用于后续比较。

3.2 TargetScore  3.2 目标分数

Figure 1 illustrates the intuition behind targetScore. The top 10 interactions in terms of logFC logFC logFC\operatorname{logFC}, targetScanCS or PCT from the real data were plotted for the validated and unvalidated targets. Rectangle size corresponds to the magnitude of the scores
图 1 展示了 targetScore 背后的直觉。根据真实数据中的 logFC logFC logFC\operatorname{logFC} 、targetScanCS 或 PCT,绘制了已验证和未验证目标的前 10 个交互。矩形大小与得分大小相对应

Fig. 1. Hinton plot (Rumelhart and Hintont, 1986) of target feature scores. Top 10 interactions in terms of logFC logFC logFC\operatorname{logFC}, targetScanCS or PCT for validated and unvalidated targets are displayed. Gray/white indicates + / + / +//-+/- sign. Rectangle size is proportional to the feature size rescaled to [ 0 , ± 1 ] [ 0 , ± 1 ] [0,+-1][0, \pm 1]
图 1.目标特征得分的 Hinton 图(Rumelhart 和 Hintont,1986 年)。图中显示了已验证和未验证目标的 logFC logFC logFC\operatorname{logFC} 、targetScanCS 或 PCT 的前 10 个交互作用。灰色/白色表示 + / + / +//-+/- 符号。矩形大小与特征大小成正比,并按 [ 0 , ± 1 ] [ 0 , ± 1 ] [0,+-1][0, \pm 1] 比例重新调整。

rescaled to [ 0 , ± 1 ] [ 0 , ± 1 ] [0,+-1][0, \pm 1]. Intuitively, we observe higher targetScores as the three features scores (logFC, targetScanCS and PCT) become more negative. In particular, the superior performance of targetScore is more attributed to the log F C log F C log FC\log F C feature than the other two in silico predictors; this is consistent with our hypothesis that miRNA-overexpression data is valuable in miRNA target prediction. Among the validated targets (left portion of the plot), for instance, the top 10 logFC 10 logFC 10 logFC10 \operatorname{logFC} attribute to higher targetScore than the top 10 scores from targetScanCS/PCT. On the other hand, targetScanCS and PCT complement logFC logFC logFC\operatorname{logFC}, as some (un)validated targets have (high) low log FC log FC log FC\log \mathrm{FC} but (low) high targetScanCS/PCT scores. Thus, targetScore is a more robust predictor by integrating the three scores.
重新调整为 [ 0 , ± 1 ] [ 0 , ± 1 ] [0,+-1][0, \pm 1] 。直观地说,随着三个特征得分(logFC、targetScanCS 和 PCT)变得更负,我们观察到 targetScore 越高。特别是,targetScore 的优越性能更多地归功于 log F C log F C log FC\log F C 特征,而不是其他两个硅学预测因子;这与我们的假设一致,即 miRNA 表达数据对 miRNA 靶标预测很有价值。例如,在已验证的靶标中(图的左侧部分),前 10 logFC 10 logFC 10 logFC10 \operatorname{logFC} 的 targetScore 比 targetScanCS/PCT 的前 10 个得分高。另一方面,targetScanCS 和 PCT 与 logFC logFC logFC\operatorname{logFC} 互为补充,因为一些(未经)验证的目标具有(高)低 log FC log FC log FC\log \mathrm{FC} 但(低)高的 targetScanCS/PCT 分数。因此,通过整合这三个分数,targetScore 是一个更稳健的预测指标。

3.3 Comparisons of target prediction methods
3.3 目标预测方法比较

Figure 2 and Supplementary Figures S2 and S3 show that our targetScore method outperforms other existing methods. Specifically, targetScore achieves significantly higher AUC of both ROC and PR than those from other methods (except for the PR-AUC from targetScanPCT) ( P < 0.01 ( P < 0.01 (P < 0.01(P<0.01; one-sided Wilcoxon signed-rank test; Fig. 2C). Additionally, targetScore dominates the largest number of tests among the 35 miRNAs: it has the best ROC and PRC for 22 and 17 miRNAs, respectively (Fig. 2D). It is also worth noting that our method had a large leading margin ahead of the second best method in several tests (e.g. hsa-miR-20a/192/16 in ROC and hsa-miR-1/124 in PRC; right panels from Fig. 2A and B). Thus, targetScore is able to outperform the best individual method by integrating the complementary information generated from each method.
图 2 以及补充图 S2 和 S3 显示,我们的 targetScore 方法优于其他现有方法。具体来说,targetScore 的 ROC 和 PR 的 AUC 都明显高于其他方法(targetScanPCT 的 PR-AUC 除外) ( P < 0.01 ( P < 0.01 (P < 0.01(P<0.01 ;单侧 Wilcoxon 符号秩检验;图 2C)。此外,在 35 个 miRNA 中,targetScore 主导了最多的测试:它分别对 22 个和 17 个 miRNA 具有最佳的 ROC 和 PRC(图 2D)。值得注意的是,我们的方法在一些测试中领先于第二好的方法(如 ROC 中的 hsa-miR-20a/192/16,PRC 中的 hsa-miR-1/124;图 2A 和 B 右侧面板)。因此,targetScore 通过整合每种方法产生的互补信息,能够超越最佳的单个方法。

3.4 Protein downregulation
3.4 蛋白质下调

Because of the potential translational repression by miRNA, negative logFC logFC logFC\operatorname{logFC} of protein outputs due to transfection of hsa- miR 1 / 124 / 181 miR 1 / 124 / 181 miR-1//124//181\mathrm{miR}-1 / 124 / 181 is a more direct indicator of the true miRNA targets than mRNA expression. Targets ranked by targetScore exhibit comparable negative cumulative logFC logFC logFC\operatorname{logFC} comparing with the best among other methods (Fig. 3). Thus, the protein outputs are mostly consistent with the above ROC/PR statistics, indicating that the confidence target predictions from targetScore at the mRNA level translate well to the miRNA effects at the protein level.
由于 miRNA 潜在的翻译抑制作用,转染 hsa- miR 1 / 124 / 181 miR 1 / 124 / 181 miR-1//124//181\mathrm{miR}-1 / 124 / 181 后蛋白质输出的负 logFC logFC logFC\operatorname{logFC} 比 mRNA 表达更能直接反映真正的 miRNA 靶标。用 targetScore 排列的靶标与其他方法中最好的靶标相比,呈现出相似的负累积 logFC logFC logFC\operatorname{logFC} (图 3)。因此,蛋白质的输出结果与上述 ROC/PR 统计结果基本一致,表明 targetScore 在 mRNA 水平上的可信靶点预测能很好地转化为 miRNA 在蛋白质水平上的效应。

3.5 GO enrichment of predicted targets
3.5 预测目标的 GO 富集

The confidence targets filtered by targetScore (at FPR < 0.3 < 0.3 < 0.3<0.3 ) are enriched for more significant GO terms at FDR < 0.05 FDR < 0.05 FDR < 0.05\mathrm{FDR}<0.05 than other methods in 19 of the 35 tests (Fig. 4A and Supplementary Table S2). Overall, targetScore recovers significantly higher number of GO terms than other methods ( P < 1 e 6 ) ( P < 1 e 6 ) (P < 1e-6)(P<1 \mathrm{e}-6) except for Expmicro ( P < 0.14 ( P < 0.14 (P < 0.14(P<0.14; one-sided Wilcoxon signed-rank test; Fig. 4A inset table). Moreover, GO terms identified from targetScore targets (targetScore-GO) exhibit significantly higher cumulative enrichment scores than those from Expmicro in nine tests and those from log FC log FC log FC\log \mathrm{FC}, targetScanCS/PCT in majority of the tests as indicated by the gray boxes in Figure 4 B ( P < 0.05 4 B ( P < 0.05 4B(P < 0.054 \mathrm{~B}(P<0.05, one-sided paired Kolmogorov-Smirnov (KS) test; see Supplementary Fig. S4 for details). In contrast, Expmicro-GO is significantly more enriched than targetScore-GO in only four
在 35 次测试中的 19 次测试中,targetScore 筛选出的置信目标(FPR < 0.3 < 0.3 < 0.3<0.3 时)在 FDR < 0.05 FDR < 0.05 FDR < 0.05\mathrm{FDR}<0.05 处比其他方法富集了更多重要的 GO 术语(图 4A 和补充表 S2)。总体而言,除 Expmicro ( P < 0.14 ( P < 0.14 (P < 0.14(P<0.14 外,targetScore 恢复的 GO 术语数量明显高于其他方法 ( P < 1 e 6 ) ( P < 1 e 6 ) (P < 1e-6)(P<1 \mathrm{e}-6) ;单侧 Wilcoxon 符号秩检验;图 4A 插页表)。此外,从 targetScore 靶标(targetScore-GO)中鉴定出的 GO 术语在 9 个测试中的累积富集得分显著高于 Expmicro 中的富集得分,在大多数测试中的累积富集得分显著高于 log FC log FC log FC\log \mathrm{FC} 、 targetScanCS/PCT 中的富集得分,如图 4 B ( P < 0.05 4 B ( P < 0.05 4B(P < 0.054 \mathrm{~B}(P<0.05 中灰色方框所示,单侧配对 Kolmogorov-Smirnov (KS) 检验;详见补充图 S4)。相比之下,Expmicro-GO 仅在以下四个方面的富集程度明显高于 targetScore-GO
A ROC  平机会
B Precision-recall  B 精确调用
C Paired test on AUC of TargetScore vs other method
C TargetScore 与其他方法 AUC 的配对测试
logFC targetScanCS targetScanPCT expmicro
ROC 1.3 e 09 1.3 e 09 1.3e-091.3 \mathrm{e}-09 8.7 e 04 8.7 e 04 8.7e-048.7 \mathrm{e}-04 1.3 e 03 1.3 e 03 1.3e-031.3 \mathrm{e}-03 5.5 e 10 5.5 e 10 5.5e-105.5 \mathrm{e}-10
PRC 2.5 e 07 2.5 e 07 2.5e-072.5 \mathrm{e}-07 3.3 e 03 3.3 e 03 3.3e-033.3 \mathrm{e}-03 1.8 e 01 1.8 e 01 1.8e-011.8 \mathrm{e}-01 1.9 e 08 1.9 e 08 1.9e-081.9 \mathrm{e}-08
logFC targetScanCS targetScanPCT expmicro ROC 1.3e-09 8.7e-04 1.3e-03 5.5e-10 PRC 2.5e-07 3.3e-03 1.8e-01 1.9e-08| | logFC | targetScanCS | targetScanPCT | expmicro | | :---: | :---: | :---: | :---: | :---: | | ROC | $1.3 \mathrm{e}-09$ | $8.7 \mathrm{e}-04$ | $1.3 \mathrm{e}-03$ | $5.5 \mathrm{e}-10$ | | PRC | $2.5 \mathrm{e}-07$ | $3.3 \mathrm{e}-03$ | $1.8 \mathrm{e}-01$ | $1.9 \mathrm{e}-08$ |

D Tests for which method X has the best performance
D 测试哪种方法 X 性能最好
logFC targetScanCS targetScanPCT expmicro targetScore  目标分数
ROC 1 5 6 1 22
PRC 2 3 13 0 17
logFC targetScanCS targetScanPCT expmicro targetScore ROC 1 5 6 1 22 PRC 2 3 13 0 17| | logFC | targetScanCS | targetScanPCT | expmicro | targetScore | | :---: | :---: | :---: | :---: | :---: | :---: | | ROC | 1 | 5 | 6 | 1 | 22 | | PRC | 2 | 3 | 13 | 0 | 17 |
Fig. 2. Evaluation of target prediction methods on 35 miRNAs. Based on the experimentally validated targets (Hsu et al., 2011), ROC (A) and PR (B) curves were constructed using scores from logFC, targetScanCS, targetScanPCT, expmicro and targetScore. The left panels display the AUC of ROC/PR for each method across the 35 miRNAs, which were ordered by the decreasing differences between targetScore-AUC and AUC from the best-performing competitor. The miRNAs in red indicate that targetScore is best among all. The specific ROC/PR curves for the top and bottom three miRNAs as highlighted in the orange boxes were illustrated on the right panels. © P P PP-values from one-sided Wilcoxon signed-rank test by comparing the AUCs for ROC/PR from targetScore with the AUCs from other methods. (D) The number of miRNAs for which a particular prediction method has the best performance in ROC/PR
图 2.在 35 个 miRNA 上评估靶标预测方法。根据实验验证的靶标(Hsu 等,2011 年),利用 logFC、targetScanCS、targetScanPCT、expmicro 和 targetScore 的得分构建了 ROC(A)和 PR(B)曲线。左侧面板显示了 35 种 miRNA 中每种方法的 ROC/PR 的 AUC,按照 targetScore-AUC 与表现最好的竞争者的 AUC 之间的差异递减排序。红色的 miRNA 表示 targetScore 在所有 miRNA 中表现最好。右侧面板显示了橙色方框中突出显示的前三名和后三名 miRNA 的具体 ROC/PR 曲线。(C) 通过比较 targetScore 的 ROC/PR AUC 与其他方法的 AUC,得出的单侧 Wilcoxon 符号秩检验 P P PP 值。(D) 特定预测方法在 ROC/PR 方面表现最佳的 miRNA 数量

Fig. 3. Cumulative sum of protein downregulation as a function of the top 200 rankings of target predictions for hsa-miR-1, 124 and 181. For each comparison method, cumulative sum of log 2 log 2 log 2\log 2 fold-change of protein outputs measured in miRNA transfection experiments (Baek et al., 2008) is plotted as a function of the top 200 rankings. At the same rank index, a superior method is expected to reach greater cumulative sum of protein down-fold, indicated by the steeper curve
图 3.蛋白质下调的累积总和与 hsa-miR-1、124 和 181 目标预测的前 200 个排名的函数关系。对于每种比较方法,在 miRNA 转染实验(Baek 等人,2008 年)中测得的蛋白质输出 log 2 log 2 log 2\log 2 倍变的累积总和与前 200 个排名的函数关系如图所示。在相同的排名指数下,优越的方法预计会达到更大的蛋白质折减累积总和,这表现为曲线更陡峭
A
B
Fig. 4. Comparison of GO enrichments. For each comparison method, genes with equal or higher score than a cutoff in which FPR < 0.3 < 0.3 < 0.3<0.3 were subjected to GO enrichment tests. (A) At FDR < 0.05 < 0.05 < 0.05<0.05, we counted the number of significant GO terms for each method across the 35 miRNAs. The miRNAs in red indicate that the number of GO terms from targetScore is the highest among all. (Inset) the P P PP-values indicate whether the corresponding numbers of GO terms from targetScore are significantly higher than those from other methods based on one-sided Wilcoxon signed-rank test. (B) The (white) gray boxes indicate that the cumulative GO enrichment scores from targetScore are (not) significantly higher than those from competitors based on one-sided paired KS-test at P < 0.05 P < 0.05 P < 0.05P<0.05. The columns are in the same order as the miRNAs in panel A. The cumulative density function (CDF) plot for the first and last column (i.e. hsa-miR-106b and 27a) was illustrated in the bottom panel. At the same density ( y y yy-axis), the enrichment scores ( x x xx-axis) are the highest if the corresponding CDF is on the right side of other CDFs. All of the CDF plots and detailed P P PP-values are presented in Supplementary Figure S 4
图 4.GO 富集度比较。对于每种比较方法,对 FDR < 0.3 < 0.3 < 0.3<0.3 时得分等于或高于临界值的基因进行 GO 富集测试。(A)在 FDR < 0.05 < 0.05 < 0.05<0.05 时,我们统计了每种方法在 35 个 miRNA 中显著的 GO 项的数量。红色的 miRNA 表示从 targetScore 中获得的 GO 术语数量是所有 miRNA 中最高的。(插图) P P PP 值表示,根据单侧 Wilcoxon 符号秩检验,targetScore 得出的相应 GO 术语数是否显著高于其他方法得出的 GO 术语数。(B)灰色方框(白色)表示,根据 P < 0.05 P < 0.05 P < 0.05P<0.05 的单侧配对 KS 检验,targetScore 的累积 GO 富集得分(不)明显高于竞争对手的累积 GO 富集得分。第一列和最后一列(即 hsa-miR-106b 和 27a)的累积密度函数(CDF)图见下图。在相同密度下( y y yy 轴),如果相应的 CDF 位于其他 CDF 的右侧,则富集分数( x x xx 轴)最高。所有 CDF 图和详细的 P P PP 值见补充图 S 4。

tests. Thus, targetScore-predicted targets are more likely enriched for meaningful BP.
测试。因此,targetScore 预测的目标更有可能富含有意义的 BP。

3.6 Oncomir and oncogene targets network
3.6 Oncomir 和癌基因靶标网络

Encouraged by the aforementioned results, we analyzed the target relationships between oncomirs (Croce, 2009; Spizzo et al., 2009) and oncogenes downloaded from COSMIC (Forbes et al., 2011) (Supplementary Table S3). Specifically, we identified 207 oncomir-oncogene interactions, involving 26 oncomirs (yellow node) and 113 oncogenes (white node) (Fig. 5). Of these interactions, 166 are validated (red solid edges) and the remaining 41 are predicted with targetScore 0.6 0.6 >= 0.6\geq 0.6 (blue dash edges). The scoring cutoff was chosen based on the differential targetScore distributions for the validated and non-validated targets (Supplementary Fig. S5). The constructed oncomir-oncogene network is highly connected. In particular, the top three oncomirs with the highest out-degree, namely, hsa-miR-155, 16 and 373 target 27, 19 and 18 oncogenes, respectively. Interestingly, 12 and all of the 19 targets of hsa-miR-155 and 16 are, respectively, validated; in contrast, 17 of the 19 targets of hsa-miR-373 are predicted with high confidence, yet call for experimental validation.
受上述结果的鼓舞,我们分析了从 COSMIC(Forbes 等人,2011 年)下载的 oncomirs(Croce,2009 年;Spizzo 等人,2009 年)与癌基因之间的靶标关系(补充表 S3)。具体来说,我们发现了 207 种肿瘤病毒与癌基因的相互作用,涉及 26 种肿瘤病毒(黄色节点)和 113 种癌基因(白色节点)(图 5)。在这些相互作用中,166 种是经过验证的(红色实线边缘),其余 41 种是用 targetScore 0.6 0.6 >= 0.6\geq 0.6 预测的(蓝色虚线边缘)。评分临界值是根据已验证和未验证靶标的 targetScore 分布差异选择的(补充图 S5)。构建的oncomir-癌基因网络具有高度连接性。特别是,外度最高的前三个oncomir,即 hsa-miR-155、16 和 373,分别靶向 27、19 和 18 个癌基因。有趣的是,hsa-miR-155 和 16 的 19 个靶点中,分别有 12 个和全部靶点已被验证;相反,hsa-miR-373 的 19 个靶点中,有 17 个靶点的预测置信度很高,但仍需要实验验证。

4 DISCUSSION  4 讨论

Most of recently developed miRNA target prediction methods exhibit a paradigm shift from rule-based binary classification to a more context-dependent, quantitative and probabilistic approach (Huang et al., 2007; Le and Bar-Joseph, 2011). The momentum of this shift is largely facilitated by the increasing amount of expression profiling data of mRNAs (and miRNAs) across various experimental conditions. However, most of these expres-sion-based methods are based on correlation and thus require a
最近开发的大多数 miRNA 靶标预测方法都呈现出从基于规则的二元分类向更依赖于上下文、定量和概率方法的范式转变(Huang 等人,2007 年;Le 和 Bar-Joseph,2011 年)。这种转变的势头主要得益于各种实验条件下 mRNA(和 miRNA)表达谱数据量的不断增加。然而,这些基于表达的方法大多基于相关性,因此需要一个

large set of expression profiles of mRNAs and miRNAs across various tissues, cell lines or patients, which limit their applications to only a general survey of the robust miRNA targets rather than condition-specific miRNA targets. On the other hand, expression profiling following specific miRNA transfection (knockin) provides the most direct clue to identify in vivo functional miRNA targets. However, expression data measured by either microarrays or RNA-seq are noisy, and it is known that changes in expression can be caused by indirect regulatory effect by miRNAs (Khan et al., 2009), which is not easily distinguishable from direct effects without the aid of sequence-based information. To our knowledge, few programs are specifically developed for transfection-based miRNA target prediction. Among them, Sylamer is the earliest work developed to identify enriched k -mer motifs, which are the seed regions among the top ranked targets based on P P PP-values obtained from t t tt-test (van Dongen et al., 2008). Thus, the method does not model the distribution of fold-changes or sequence features. Instead, Sylamer is mainly designed to visually inspect the enrichment of k-mer patterns rather than predicting specific targets.
这些方法的应用范围有限,只能对强大的 miRNA 靶点进行一般性调查,而不能确定特定条件下的 miRNA 靶点。另一方面,特定 miRNA 转染(基因敲除)后的表达谱分析为确定体内功能性 miRNA 靶点提供了最直接的线索。然而,微阵列或 RNA-seq 测量的表达数据都有噪声,而且众所周知,表达的变化可能是由 miRNA 的间接调控效应引起的(Khan 等人,2009 年),如果没有基于序列的信息辅助,这种效应与直接效应不易区分。据我们所知,很少有程序是专门为基于转染的 miRNA 靶标预测而开发的。其中,Sylamer 是最早开发的用于识别富集 k -mer motifs 的软件,这些 k -mer motifs 是根据 P P PP 测试得到的 t t tt 值在排名靠前的靶标中找到的种子区域(van Dongen 等人,2008 年)。因此,该方法并不模拟折叠变化或序列特征的分布。相反,Sylamer 主要是为了直观地检查 k-mer 模式的富集情况,而不是预测具体的靶标。
In this article, we introduce targetScore, a Bayesian probabilistic scoring method taking into account both the fold-change due to miRNA overexpression and sequence-based information. The proposed method is similar to Expmicro (Liu et al., 2010a). However, Expmicro requires training data to calculate the sequence-based scores using SVMicro and used only a twocomponent GMM to model the fold-changes. In contrast, targetScore models the distributions of multiple sets of precomputed or user-supplied sequence-based scores and fold-changes using respective two- and three-component VB-GMM.
本文介绍了一种贝叶斯概率评分方法 targetScore,它同时考虑了 miRNA 过表达引起的折叠变化和基于序列的信息。该方法类似于 Expmicro(Liu 等人,2010a)。然而,Expmicro 需要使用 SVMicro 计算基于序列的分数的训练数据,并且只使用了双分量 GMM 来模拟折叠变化。相比之下,targetScore 使用各自的双分量和三分量 VB-GMM 对多组预先计算或用户提供的基于序列的分数和折叠变化的分布进行建模。
We complied (to our knowledge) the largest set of overexpression data compendium and demonstrated the utilities of targetScore in extensive tests. TargetScore achieved superior ROC and PR performances (Fig. 2 and Supplementary Figs S2
据我们所知,我们收集了最大的过表达数据集,并在大量测试中证明了 targetScore 的实用性。TargetScore 实现了卓越的 ROC 和 PR 性能(图 2 和补充图 S2

Fig. 5. Oncomir-oncogene network. The network drawn by Cytoscape 3 (Shannon et al., 2003) comprises 207 oncomir-oncogene interactions, where 166 (41) are validated (predicted with targetScore 0.6 0.6 >= 0.6\geq 0.6 ), involving 26 oncomirs (yellow) and 113 oncogenes (white). The red solid and blue dash edges are the validated and predicted interactions, respectively. The size of the node is proportional to the connection degree. Edge widths are proportional to the targetScore
图 5.Oncomir-oncogene 网络。由 Cytoscape 3(Shannon 等人,2003 年)绘制的网络包括 207 个肿瘤基因与癌基因的相互作用,其中 166 个(41)是经过验证的(用 targetScore 0.6 0.6 >= 0.6\geq 0.6 预测),涉及 26 个肿瘤基因(黄色)和 113 个癌基因(白色)。红色实线和蓝色虚线分别是验证的和预测的相互作用。节点的大小与连接度成正比。边缘宽度与 targetScore 成比例

and S3). The cumulative protein outputs for miR 1 / 124 / 181 miR 1 / 124 / 181 miR-1//124//181\mathrm{miR}-1 / 124 / 181 as a function of the top 200 rankings from targetScore reveals the (second) deepest protein downregulation comparing with the top targets from other methods (Fig. 3). Thus, the relative overall trends at the protein level are consistent with the predictions at the mRNA level, where targetScore dominates most of the tests. Moreover, targetScore targets are more enriched for meaningful BP when compared with targets predicted by other methods (Fig. 4 and Supplementary Fig. S4). Using targetScore, we constructed an oncomir-oncogene regulatory network (Fig. 5) and hypothesized that our predicted oncomir-oncogene interactions may provide further insights into cancer network.
和 S3)。 miR 1 / 124 / 181 miR 1 / 124 / 181 miR-1//124//181\mathrm{miR}-1 / 124 / 181 的累积蛋白质输出量与 targetScore 的前 200 个排名的函数关系显示,与其他方法的顶级靶标相比, miR 1 / 124 / 181 miR 1 / 124 / 181 miR-1//124//181\mathrm{miR}-1 / 124 / 181 的蛋白质下调程度(第二)最深(图 3)。因此,蛋白质水平的相对总体趋势与 mRNA 水平的预测是一致的,在 mRNA 水平上 targetScore 主导了大多数测试。此外,与其他方法预测的靶标相比,targetScore 的靶标更富含有意义的 BP(图 4 和补充图 S4)。利用 targetScore,我们构建了一个 oncomir-癌基因调控网络(图 5),并假设我们预测的 oncomir-癌基因相互作用可能为癌症网络提供进一步的见解。
The success of our method underlines the importance of integrating (preferably independent) informative predictors into a unified framework. On the other hand, the absolute PR performances are generally poor among all tested methods perhaps because of the limited number of validated targets and/or the
我们方法的成功强调了将(最好是独立的)信息预测指标整合到统一框架中的重要性。另一方面,在所有测试方法中,PR 的绝对性能普遍较差,这可能是因为验证目标的数量有限和/或

limitations of the methods. Although targetScore compares favorably with the published methods, there are a few issues that remain to be addressed in future work. Several studies have shown that the expression fold-change also depends on the initial abundance and the natural decay or turnover rate of the corresponding mRNA species (Larsson et al., 2010). Presumably, a future target prediction algorithm will benefit from the use of this information in a cell-specific context. Additionally, targetScore assumes that the mRNA targets are independent of each other. However, Arvey et al. (2010) have shown that miRNAs that have a higher number of available target transcripts will downregulate each individual target gene to a lesser extent than those with a lower number of targets. Although the context score from targetScanCS has incorporated the static target abundance information as the total number of target sites that can be recognized by the same miRNA, it would be more realistic to consider the underlying expression level of the co-targeted mRNA to infer the actual available target sites.
方法的局限性。尽管 targetScore 与已发表的方法相比效果较好,但仍有一些问题需要在今后的工作中加以解决。一些研究表明,表达折叠变化也取决于相应 mRNA 物种的初始丰度和自然衰减或周转率(Larsson 等人,2010 年)。据推测,未来的靶标预测算法将受益于细胞特异性背景下这一信息的使用。此外,targetScore 假定 mRNA 靶标是相互独立的。然而,Arvey 等人(2010 年)的研究表明,与靶标数量较少的 miRNA 相比,靶标转录本数量较多的 miRNA 对每个靶基因的下调程度较小。虽然 targetScanCS 的上下文得分已将静态靶丰度信息纳入同一 miRNA 可识别的靶点总数,但考虑共靶 mRNA 的基本表达水平来推断实际可用的靶点会更切合实际。

ACKNOWLEDGEMENTS  致谢

The authors thank Quaid Morris for useful discussion, and Yufei Huang and Hui Liu for the helps on running Expmicro. This paper is dedicated to the memory of SZ.
作者感谢奎德-莫里斯(Quaid Morris)的有益讨论,感谢黄宇飞和刘辉在运行 Expmicro 时提供的帮助。本文谨以此纪念 SZ。
Funding: Natural Sciences and Engineering Research Council (NSERC) Canada Graduate Scholarship (to Y.L.) and Ontario Research Fund-Global Leader (Round 2) and an NSERC grant (to Z.Z.).
资助:自然科学和工程研究理事会(NSERC)加拿大研究生奖学金(Y.L.)和安大略研究基金-全球领导者(第二轮)以及自然科学和工程研究理事会补助金(Z.Z.)。
Conflict of Interest: None declared.
利益冲突:未声明。

REFERENCES  参考文献

Alexiou,P. et al. (2009) Lost in translation: an assessment and perspective for computational microRNA target identification. Bioinformatics, 25, 3049-3055.
Alexiou,P. et al. (2009) Lost in translation: an assessment and perspective for computational microRNA target identification.生物信息学》,25,3049-3055。

Anders,G. et al. (2011) doRiNA: a database of RNA interactions in posttranscriptional regulation. Nucleic Acids Res., 40, D180-D186.
Anders,G. et al. (2011) doRiNA: a database of RNA interactions in posttranscriptional regulation.核酸研究》,40,D180-D186。

Arvey,A. et al. (2010) Target mRNA abundance dilutes microRNA and siRNA activity. Mol. Syst. Biol., 6, 1-7.
Arvey,A. 等人 (2010) 目标 mRNA 丰度会稀释 microRNA 和 siRNA 的活性。Mol.Syst.Biol., 6, 1-7.

Baek,D. et al. (2008) The impact of microRNAs on protein output. Nature, 455, 64-71.
Baek,D. et al. (2008) The impact of microRNAs on protein output.自然》,455,64-71。

Bartel,D.P. (2009) MicroRNAs: target recognition and regulatory functions. Cell, 136, 215-233.
Bartel,D.P. (2009) MicroRNAs:目标识别和调控功能。细胞》,136,215-233。

Bishop,C.M. (2006) Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, New York.
Bishop,C.M. (2006) Pattern Recognition and Machine Learning (Information Science and Statistics)。纽约 Springer 出版社。

Croce,C.M. (2009) Causes and consequences of microRNA dysregulation in cancer. Nat. Rev. Genet., 10, 704-714.
Croce,C.M.(2009)癌症中微小 RNA 失调的原因和后果。Nat.Rev.Genet.,10,704-714。

Davis,S. and Meltzer,P.S. (2007) GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor. Bioinformatics, 23, 1846-1847.
Davis,S. and Meltzer,P.S. (2007) GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor.生物信息学》,23,1846-1847。

Durinck,S. et al. (2009) Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc., 4, 1184-1191.
Durinck,S. et al. (2009) Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat.4,1184-1191。

Enright,A.J. et al. (2004) MicroRNA targets in Drosophila. Genome Biol., 5, R1-R14.
Enright,A.J. 等(2004 年)果蝇的 MicroRNA 靶标。基因组生物学》,5,R1-R14。

Forbes,S.A. et al. (2011) COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res., 39, D945-D950.
Forbes,S.A. et al. (2011) COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer.核酸研究》,39,D945-D950。

Friedman,R.C. et al. (2009) Most mammalian mRNAs are conserved targets of microRNAs. Genome Res., 19, 92-105.
Friedman,R.C. 等人 (2009) 大多数哺乳动物 mRNA 是 microRNA 的保守靶标。基因组研究》,19,92-105。

Garcia,D.M. et al. (2011) Weak seed-pairing stability and high target-site abundance decrease the proficiency of lsy-6 and other microRNAs. Nat. Struct. Mol. Biol., 18, 1139-1146.
Garcia,D.M.等(2011)弱种子配对稳定性和高靶位丰度降低了 lsy-6 和其他 microRNA 的能力。Nat.Struct.Mol.Biol.,18,1139-1146。

Grimson,A. et al. (2007) MicroRNA targeting specificity in mammals: determinants beyond seed pairing. Mol. Cell, 27, 91-105.
Grimson,A. et al. (2007) MicroRNA 在哺乳动物中的靶向特异性:种子配对之外的决定因素。Mol. Cell, 27, 91-105.细胞,27,91-105。
Hsu,S.-D. et al. (2011) miRTarBase: a database curates experimentally validated microRNA-target interactions. Nucleic Acids Res., 39, D163-D169.
Hsu,S.-D. et al. (2011) miRTarBase: a database curates experally validated microRNA-target interactions.核酸研究》,39,D163-D169。

Huang,J.C. et al. (2007) Bayesian inference of microRNA targets from sequence and expression data. J. Comput. Biol., 14, 550-563.
Huang,J.C. et al. (2007) Bayesian inference of microRNA targets from sequence and expression data.J. Comput.Biol., 14, 550-563.

Kertesz,M. et al. (2007) The role of site accessibility in microRNA target recognition. Nat. Genet., 39, 1278-1284.
Kertesz,M. 等人(2007 年),位点可及性在 microRNA 目标识别中的作用。Nat.Genet.,39,1278-1284。

Khan,A.A. et al. (2009) Transfection of small RNAs globally perturbs gene regulation by endogenous microRNAs. Nat. Biotechnol., 27, 549-555.
Khan,A.A. 等人(2009 年)转染小 RNA 会全面扰乱内源性 microRNA 对基因的调控。Nat.生物技术》,27,549-555。

Krek,A. et al. (2005) Combinatorial microRNA target predictions. Nat. Genet., 37, 495-500.
Krek,A. et al. (2005) Combinatorial microRNA target predictions.Nat.Genet.,37,495-500。

Larsson,E. et al. (2010) mRNA turnover rate limits siRNA and microRNA efficacy. Mol. Syst. Biol., 6, 1-9.
Larsson,E. 等人(2010)mRNA 的周转率限制了 siRNA 和 microRNA 的功效。Mol.Syst.Biol., 6, 1-9.

Le,H.S. and Bar-Joseph,Z. (2011) Inferring interaction networks using the IBP applied to microRNA target prediction. In: Shawe-Taylor,J. et al. (eds) Advances in Neural Information Processing Systems. Vol. 24, Sierra Nevada, Spain, pp. 235-243.
Le,H.S. and Bar-Joseph,Z.(2011) Inferring interaction networks using the IBP applied to microRNA target prediction.In:Shawe-Taylor,J. et al. (eds) Advances in Neural Information Processing Systems.24, Sierra Nevada, Spain, pp.

Lewis,B.P. et al. (2003) Prediction of mammalian microRNA targets. Cell, 115, 787-798.
Lewis,B.P. et al. (2003) Prediction of mammalian microRNA targets.细胞》,115,787-798。

Lim,L.P. et al. (2005) Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature, 433, 769-773.
Lim,L.P. 等人 (2005) 微阵列分析表明,一些 microRNA 可下调大量靶 mRNA。自然》,433,769-773。

Liu,H. et al. (2010a) A Bayesian approach for identifying miRNA targets by combining sequence prediction and gene expression profiling. BMC Genomics, 11 (Suppl. 3), S12.
Liu,H. et al. (2010a) A Bayesian approach for identifying miRNA targets by combining sequence prediction and gene expression profiling.BMC Genomics, 11 (Suppl. 3), S12.

Liu,H. et al. (2010b) Improving performance of mammalian microRNA target prediction. BMC Bioinformatics, 11, 476.
Liu,H. et al. (2010b) Improving performance of mammalian microRNA target prediction.BMC Bioinformatics, 11, 476.

Lorenz,R. et al. (2011) ViennaRNA Package 2.0. Algorithms Mol. Biol., 6, 26.
Lorenz,R. et al. (2011) ViennaRNA Package 2.0.Algorithms Mol.Biol., 6, 26.

Rumelhart,D. and Hintont,G. (1986) Learning representations by back-propagating errors. Nature, 323, 533-536.
Rumelhart,D. and Hintont,G. (1986) Learning representations by back-propagating errors.自然》,323,533-536。

Selbach,M. et al. (2008) Widespread changes in protein synthesis induced by microRNAs. Nature, 455, 58-63.
Selbach,M. et al. (2008) Widespread changes in protein synthesis induced by microRNAs.自然》,455,58-63。

Shannon,P. et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res., 13, 2498-2504.
Shannon,P. et al. (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks.Genome Res., 13, 2498-2504.

Sing,T. et al. (2005) ROCR: visualizing classifier performance in R. Bioinformatics, 21, 3940-3941.
Spizzo,R. et al. (2009) SnapShot: microRNAs in cancer. Cell, 137, 586-586.el.
Spizzo,R. et al. (2009) SnapShot: microRNAs in cancer.细胞》,137,586-586.El。

Sumazin,P. et al. (2011) An extensive microRNA-mediated network of RNA-RNA interactions regulates established oncogenic pathways in glioblastoma. Cell, 147, 370-381.
Sumazin,P. et al. (2011) 一个广泛的 microRNA 介导的 RNA-RNA 相互作用网络调控着胶质母细胞瘤的既定致癌途径。细胞》,147,370-381。

Troyanskaya,O. et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520-525.
Troyanskaya,O. 等人(2001 年)DNA 微阵列的缺失值估计方法。Bioinformatics, 17, 520-525.

van Dongen,S. et al. (2008) Detecting microRNA binding and siRNA off-target effects from expression data. Nat. Methods, 5, 1023-1025.
van Dongen,S. 等人(2008 年)从表达数据中检测 microRNA 结合和 siRNA 脱靶效应。Nat.Methods,5,1023-1025。

Vergoulis, T. et al. (2012) TarBase 6.0: capturing the exponential growth of miRNA targets with experimental support. Nucleic Acids Res., 40, D222-D229.
Vergoulis, T. et al. (2012) TarBase 6.0:利用实验支持捕捉 miRNA 靶点的指数增长。Nucleic Acids Res., 40, D222-D229.

Xiao,F. et al. (2009) miRecords: an integrated resource for microRNA-target interactions. Nucleic Acids Res., 37, D105-D110.
Xiao,F. et al. (2009) miRecords: an integrated resource for microRNA-target interactions.核酸研究》,37,D105-D110。

  1. *To whom correspondence should be addressed.
    * 通信收件人:
  2. Note: Check marks indicate the type of information used by each method.
    注:复选标记表示每种方法使用的信息类型。

    a a ^("a"){ }^{\text {a}} Context score (CS) is a sequence-based score for individual target sites calculated by targetScan (Grimson et al., 2007; Garcia et al., 2011).
    a a ^("a"){ }^{\text {a}} 上下文得分(Context score,CS)是 targetScan(Grimson 等人,2007 年;Garcia 等人,2011 年)为单个目标位点计算的基于序列的得分。

    b P C T b P C T ^(b)P_(CT){ }^{\mathrm{b}} P_{C T} is the probability of conserved targeting for individual target sites and also available from targetScan Web site (Friedman et al., 2009);
    b P C T b P C T ^(b)P_(CT){ }^{\mathrm{b}} P_{C T} 是单个靶点的保守靶向概率,也可从 targetScan 网站获得(Friedman 等人,2009 年);

    c logFC c logFC ^(c)logFC{ }^{c} \operatorname{logFC} : logarithmic fold-change due to miRNA transfection.
    c logFC c logFC ^(c)logFC{ }^{c} \operatorname{logFC} :miRNA 转染引起的对数折叠变化。