这是用户在 2025-4-7 1:20 为 https://en.wikipedia.org/wiki/Pseudo_amino_acid_composition 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Jump to content

Pseudo amino acid composition
伪氨基酸组成

From Wikipedia, the free encyclopedia
摘自维基百科,自由的百科全书

In molecular biology, pseudo amino acid composition (PseACC) is a method introduced by Kuo-Chen Chou to convert the protein sequence into a numerical vector for enhancing pattern recognition techniques, such as during discrimination between classes of proteins based on their sequences (e.g. between membrane proteins, transmembrane proteins, cytosolic proteins, and other types).[1] This method represented an advance beyond using the immediate amino acid composition (AAC). Instead, the protein is characterized into a matrix of amino-acid frequencies. This matrix incorporates not only amino acid composition, but can also incorporate information from local features of the protein sequence.[2]
在分子生物学中,伪氨基酸组成(PseACC)是由 Kuo-Chen Chou 提出的一种方法,旨在将蛋白质序列转化为数值向量,以增强模式识别技术,例如基于序列区分不同类别的蛋白质(如膜蛋白、跨膜蛋白、胞质蛋白及其他类型)。 [1] 该方法超越了直接使用氨基酸组成(AAC)的传统方式,转而将蛋白质特征化为氨基酸频率矩阵。此矩阵不仅包含氨基酸组成信息,还能整合蛋白质序列局部特征的相关数据。 [2]

Due to the success and widespread application of the PseACC method, it was extended to address sequence-order effects in nucleotide compositions, giving rise to a comparative method called PseKNC.[3]
鉴于 PseACC 方法的成功与广泛应用,其被进一步扩展以解决核苷酸组成中的序列顺序效应,由此诞生了名为 PseKNC 的对比方法。 [3]

Sequential and discrete models
序列与离散模型

[edit]

Two kinds of models are usually used to represent protein samples: the sequential and the discrete (or non-sequential) models.[4] The most elementary sequential model is to use the entire amino acid sequence, as expressed by:
通常用于表示蛋白质样本的模型有两类:序列模型和离散(或非序列)模型。 [4] 最基础的序列模型是直接使用完整的氨基酸序列,其表达式如下:

where, P represents the amino acid sequence, is the number of amino acid residues, R1 is the first residue of the protein P, R2 is the second residue, and so forth.
其中,P 代表氨基酸序列, 为氨基酸残基数量,R 1 是蛋白质 P 的第一个残基,R 2 为第二个残基,依此类推。

The problem with this approach was that in some sequence-similarity-search-based tools, the query protein often lacked significant homology (or sequence similarity) with any other known protein in the database. To resolve this problem, discrete models for representing protein samples were proposed. The simplest discrete model is using the amino acid composition (AAC) to represent protein samples. Under the AAC model, the protein P of Eq.1 can also be expressed by
这种方法的问题在于,在某些基于序列相似性搜索的工具中,查询蛋白质往往与数据库中任何已知蛋白质缺乏显著同源性(或序列相似性)。为解决此问题,研究者提出了表示蛋白质样本的离散模型。最简单的离散模型是利用氨基酸组成(AAC)来表征蛋白质样本。在 AAC 模型下,公式 1 中的蛋白质 P 也可表示为:

where are the normalized occurrence frequencies of the 20 native amino acids in P, and T is the transposing operator.[4]
其中 代表蛋白质 P 中 20 种天然氨基酸的归一化出现频率,T 为转置运算符。 [4]

Pseudo-Amino Acid Composition (PseAAC) model
伪氨基酸组成(PseAAC)模型

[edit]

The primary weakness of the discrete model that relies on the amino acid composition (AAC) is that the information on the frequencies of each amino acid from the sample alone involves a loss of sequence-order information, or information obtained by the order of the amino acid residues. To avoid this information loss, the concept of PseAAC (pseudo amino acid composition) was proposed.[1]
基于氨基酸组成(AAC)的离散模型主要弱点在于,仅从样本中获取各氨基酸频率信息会导致序列顺序信息或氨基酸残基排列顺序所蕴含的信息丢失。为避免这种信息损失,研究者提出了伪氨基酸组成(PseAAC)的概念。 [1]

Under this new model, the first 20 discrete factors represent amino acid frequencies are retained, but additional discrete factors are included that also ascertain information about sequence order. The sequence order information is represented by what are called "pseudo components". The number of additional components, beyond the first 20 frequencies, is called λ (or upper-case Λ), and so 20+λ components are included in the model. The upper limit for λ is one less than the length of the shortest protein sample in the dataset.[1] The total number of components (20+λ) may be denoted Ω. Any additional factors can be incorporated so long as they, in some way, obtain or represent information about the sequence-order. Typically, these are a series of rank-different correlation factors along the protein chain.[4]
在这一新模型下,前 20 个代表氨基酸频率的离散因子被保留,同时引入了额外的离散因子以捕获序列顺序信息。序列顺序信息通过所谓的“伪组分”来体现。超出前 20 个频率的额外组分数目记为λ(或大写Λ),因此模型中包含 20+λ个组分。λ的上限为数据集中最短蛋白质样本长度减一。 [1] 组分的总数(20+λ)可用Ω表示。只要这些额外因子能以某种方式获取或表征序列顺序信息,均可被纳入模型。通常,这些是一系列沿蛋白质链的等级差异相关因子。 [4]

The additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAAC is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model.
这些额外因素是一系列沿蛋白质链的等级差异相关性因子,但它们也可以是其他因素的任意组合,只要这些组合能以某种方式反映序列顺序效应。因此,伪氨基酸组成(PseAAC)的本质在于,它一方面涵盖了氨基酸组成信息,另一方面又包含了超出氨基酸组成的信息,从而能通过离散模型更好地反映蛋白质序列的特征。

Meanwhile, various modes to formulate the PseAAC vector have also been developed, as summarized in a 2009 review article.[2]
同时,如 2009 年的一篇综述文章所述,人们还开发了多种构建 PseAAC 向量的模式。 [2]

Algorithm  算法

[edit]
Figure 1. A schematic drawing to show (a) the 1st-tier, (b) the 2nd-tier, and (c) the 3rd-tier sequence-order-correlation mode along a protein sequence, where R1 represents the amino acid residue at the sequence position 1, R2 at position 2, and so forth (cf. Eq.1), and the coupling factors are given by Eq.6. Panel (a) reflects the correlation mode between all the most contiguous residues, panel (b) that between all the 2nd most contiguous residues, and panel (c) that between all the 3rd most contiguous residues.
图 1. 示意图展示了(a)一级,(b)二级,和(c)三级序列顺序相关性模式沿着蛋白质序列的分布,其中 R 1 代表序列位置 1 的氨基酸残基,R 2 代表位置 2,依此类推(参见公式 1),而耦合因子 由公式 6 给出。图(a)反映了所有最邻近残基间的相关性模式,图(b)是所有次邻近残基间的相关性模式,图(c)则是所有第三邻近残基间的相关性模式。

According to the PseAAC model, the protein P of Eq.1 can be formulated as
根据 PseAAC 模型,蛋白质 P 的 Eq.1 可表述为

where the () components are given by
其中 ( ) 分量由下式给出

where is the weight factor, and the -th tier correlation factor that reflects the sequence order correlation between all the -th most contiguous residues as formulated by
其中 为权重因子, 为反映所有第 个最邻近残基间序列顺序相关性的第 层相关系数,其公式表达为

with   

where is the -th function of the amino acid , and the total number of the functions considered. For example, in the original paper by Chou,[1] , and are respectively the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid ; while , and the corresponding values for the amino acid . Therefore, the total number of functions considered there is . It can be seen from Eq.3 that the first 20 components, i.e. are associated with the conventional AA composition of protein, while the remaining components are the correlation factors that reflect the 1st tier, 2nd tier, ..., and the -th tier sequence order correlation patterns (Figure 1). It is through these additional factors that some important sequence-order effects are incorporated.
其中, 代表氨基酸 的第 种功能属性,而 则是所考虑功能的总数。举例来说,在 Chou 的原始论文中, [1] 分别对应氨基酸 的疏水性值、亲水性值及侧链质量;而 则为氨基酸 的相应属性值。因此,该研究考虑的功能总数为 。由式 3 可见,前 20 个组分(即 )与蛋白质的传统氨基酸组成相关,其余组分 则反映了一级、二级直至第 级序列顺序相关模式(图 1)。正是通过这些额外的 个因子,一些关键的序列顺序效应得以纳入考量。

in Eq.3 is a parameter of integer and that choosing a different integer for will lead to a dimension-different PseAA composition.[5]
式 3 中的 是一个整数参数,选择不同的 将导致维度不同的伪氨基酸组成 [5]

Using Eq.6 is just one of the many modes for deriving the correlation factors in PseAAC or its components. The others, such as the physicochemical distance mode[6] and amphiphilic pattern mode,[7] can also be used to derive different types of PseAAC, as summarized in a 2009 review article.[2] In 2011, the formulation of PseAAC (Eq.3) was extended to a form of the general PseAAC as given by:[8]
使用方程 6 仅是推导 PseAAC 或其组分中相关性因子的众多模式之一。其他方法,如物理化学距离模式 [6] 和两亲性模式 [7] ,也可用于推导不同类型的 PseAAC,正如 2009 年的一篇综述文章中所总结的那样。 [2] 2011 年,PseAAC 的公式(方程 3)被扩展为一般 PseAAC 的形式,如下所示: [8]

where the subscript is an integer, and its value and the components will depend on how to extract the desired information from the amino acid sequence of P in Eq.1.
其中下标 为整数,其值及分量 将取决于如何从方程 1 中 P 的氨基酸序列提取所需信息。

The general PseAAC can be used to reflect any desired features according to the targets of research, including those core features such as functional domain, sequential evolution, and gene ontology to improve the prediction quality for the subcellular localization of proteins.[9][10] as well as their many other important attributes.
通用伪氨基酸组成(PseAAC)可依据研究目标反映任何所需特征,包括功能域、序列进化和基因本体等核心特征,以提升蛋白质亚细胞定位预测质量 [9] [10] ,以及它们的许多其他重要属性。

References  参考文献

[edit]
  1. ^ Jump up to: a b c d Chou KC (May 2001). "Prediction of protein cellular attributes using pseudo-amino acid composition". Proteins. 43 (3): 246–55. doi:10.1002/prot.1035. PMID 11288174. S2CID 28406797.
    周克昌(2001 年 5 月)。"利用伪氨基酸组成预测蛋白质细胞属性"。《蛋白质》。43(3): 246– 55。doi:10.1002/prot.1035。PMID 11288174。S2CID 28406797。
  2. ^ Jump up to: a b c Chou KC (2009). "Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology". Current Proteomics. 6 (4): 262–274. doi:10.2174/157016409789973707.
    周克昌(2009 年)。"伪氨基酸组成及其在生物信息学、蛋白质组学和系统生物学中的应用"。《当前蛋白质组学》。6(4): 262– 274。doi:10.2174/157016409789973707。
  3. ^ Chen, Wei; Lei, Tian-Yu; Jin, Dian-Chuan; Lin, Hao; Chou, Kuo-Chen (2014). "PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition". Analytical Biochemistry. 456: 53–60. doi:10.1016/j.ab.2014.04.001.
    陈伟; 雷天宇; 靳典传; 林浩; 周国镇 (2014). "PseKNC: 一个灵活的用于生成伪 K 元核苷酸组成的网络服务器". Analytical Biochemistry. 456: 53– 60. doi: 10.1016/j.ab.2014.04.001.
  4. ^ Jump up to: a b c Chou, Kuo-Chen (2011-03-21). "Some remarks on protein attribute prediction and pseudo amino acid composition". Journal of Theoretical Biology. 273 (1): 236–247. doi:10.1016/j.jtbi.2010.12.024. ISSN 0022-5193. PMC 7125570.
    周克成(2011-03-21)。"关于蛋白质属性预测与伪氨基酸组成的一些评述"。《理论生物学杂志》。273(1): 236– 247。doi: 10.1016/j.jtbi.2010.12.024。ISSN 0022-5193。PMC 7125570。
  5. ^ Chou KC, Shen HB (November 2007). "Recent progress in protein subcellular location prediction". Anal. Biochem. 370 (1): 1–16. doi:10.1016/j.ab.2007.07.006. PMID 17698024.
    周克成,沈红波(2007 年 11 月)。"蛋白质亚细胞定位预测的最新进展"。《分析生物化学》。370(1): 1– 16。doi: 10.1016/j.ab.2007.07.006。PMID 17698024。
  6. ^ Chou KC (November 2000). "Prediction of protein subcellular locations by incorporating quasi-sequence-order effect". Biochem. Biophys. Res. Commun. 278 (2): 477–83. doi:10.1006/bbrc.2000.3815. PMID 11097861.
    周克昌(2000 年 11 月)。"通过纳入准序列顺序效应预测蛋白质亚细胞定位"。《生物化学与生物物理研究通讯》。278(2): 477– 83。doi:10.1006/bbrc.2000.3815。PMID 11097861。
  7. ^ Chou KC (January 2005). "Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes". Bioinformatics. 21 (1): 10–9. doi:10.1093/bioinformatics/bth466. PMID 15308540.
    周克昌(2005 年 1 月)。"使用两亲性伪氨基酸组成预测酶亚家族类别"。《生物信息学》。21(1): 10– 9。doi:10.1093/bioinformatics/bth466。PMID 15308540。
  8. ^ Chou KC (March 2011). "Some remarks on protein attribute prediction and pseudo amino acid composition". Journal of Theoretical Biology. 273 (1): 236–47. Bibcode:2011JThBi.273..236C. doi:10.1016/j.jtbi.2010.12.024. PMC 7125570. PMID 21168420.
    周克成(2011 年 3 月)。"关于蛋白质属性预测与伪氨基酸组成的一些评述"。《理论生物学杂志》。273(1): 236– 47。Bibcode: 2011JThBi.273..236C。doi: 10.1016/j.jtbi.2010.12.024。PMC 7125570。PMID 21168420。
  9. ^ Chou KC, Shen HB (2008). "Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms". Nat Protoc. 3 (2): 153–62. doi:10.1038/nprot.2007.494. PMID 18274516. S2CID 226104. Archived from the original on 2007-08-27. Retrieved 2008-03-24.
    周克成,沈红波(2008)。"Cell-PLoc:一套用于预测多种生物中蛋白质亚细胞定位的网络服务器工具包"。《自然实验手册》。3(2): 153– 62。doi: 10.1038/nprot.2007.494。PMID 18274516。S2CID 226104。原始存档于 2007-08-27。检索于 2008-03-24
  10. ^ Shen HB, Chou KC (February 2008). "PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition". Anal. Biochem. 373 (2): 386–8. doi:10.1016/j.ab.2007.10.012. PMID 17976365.
    沈洪波,周克成(2008 年 2 月)。"PseAAC:一个灵活的网页服务器,用于生成多种蛋白质伪氨基酸组成"。《分析生物化学》373(2): 386– 8. doi:10.1016/j.ab.2007.10.012. PMID 17976365.
[edit]