Minimum redundancy feature selection
最小冗余特征选择
Minimum redundancy feature selection is an algorithm frequently used in a method to accurately identify characteristics of genes and phenotypes and narrow down their relevance and is usually described in its pairing with relevant feature selection as Minimum Redundancy Maximum Relevance (mRMR). This method was first proposed in 2003 by Hanchuan Peng and Chris Ding,[1] followed by a theoretical formulation based on mutual information, along with the first definition of multivariate mutual information, published in IEEE Trans. Pattern Analysis and Machine Intelligence in 2005. [2]
最小冗余特征选择是一种常用于精确识别基因与表型特征并缩小其相关性的算法,通常与相关特征选择配对描述为“最小冗余最大相关”(mRMR)。该方法由彭汉川和 Chris Ding 于 2003 年首次提出,随后在 2005 年发表于《IEEE 模式分析与机器智能汇刊》的论文中,基于互信息理论进行了数学表述,并首次定义了多元互信息概念。
Feature selection, one of the basic problems in pattern recognition and machine learning, identifies subsets of data that are relevant to the parameters used and is normally called Maximum Relevance. These subsets often contain material which is relevant but redundant and mRMR attempts to address this problem by removing those redundant subsets. mRMR has a variety of applications in many areas such as cancer diagnosis and speech recognition.
特征选择作为模式识别与机器学习领域的基础问题之一,旨在识别与所用参数相关的数据子集,通常称为“最大相关”。这些子集常包含相关但冗余的信息,而 mRMR 方法通过剔除冗余子集来解决这一问题。mRMR 在癌症诊断、语音识别等诸多领域具有广泛应用。
Features can be selected in many different ways. One scheme is to select features that correlate strongest to the classification variable. This has been called maximum-relevance selection. Many heuristic algorithms can be used, such as the sequential forward, backward, or floating selections.
特征可以通过多种方式选取。一种方案是选择与分类变量相关性最强的特征,这被称为最大相关性选择。可采用诸多启发式算法,如序列前向、后向或浮动选择。
On the other hand, features can be selected to be mutually far away from each other while still having "high" correlation to the classification variable. This scheme, termed as Minimum Redundancy Maximum Relevance (mRMR) selection has been found to be more powerful than the maximum relevance selection.
另一方面,特征也可被选择为彼此间相距较远,同时仍与分类变量保持“高”相关性。这种被称为最小冗余最大相关性(mRMR)选择的方法,已被证明比最大相关性选择更为有效。
As a special case, the "correlation" can be replaced by the statistical dependency between variables. Mutual information can be used to quantify the dependency. In this case, it is shown that mRMR is an approximation to maximizing the dependency between the joint distribution of the selected features and the classification variable.
作为一种特殊情况,“相关性”可替换为变量间的统计依赖性。互信息可用于量化这种依赖性。在此情形下,研究表明 mRMR 近似于最大化所选特征的联合分布与分类变量之间的依赖性。
Studies have tried different measures for redundancy and relevance measures. A recent study compared several measures within the context of biomedical images.[3]
已有研究尝试了不同的冗余度与相关性度量方法。近期一项研究在生物医学图像的背景下比较了多种度量标准。 [3]
References 参考文献
[edit]- ^ Chris Ding and Hanchuan Peng, "Minimum Redundancy Feature Selection from Microarray Gene Expression Data". 2nd IEEE Computer Society Bioinformatics Conference (CSB 2003), 11–14 August 2003, Stanford, CA, USA. Pages 523–529.
丁毅与彭汉川,《微阵列基因表达数据的最小冗余特征选择》。第二届 IEEE 计算机学会生物信息学会议(CSB 2003),2003 年 8 月 11 日至 14 日,美国加利福尼亚州斯坦福。第 523–529 页。 - ^ Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp. 1226–1238, 2005.
彭汉川、龙飞、丁毅,《基于互信息的特征选择:最大依赖性、最大相关性和最小冗余性准则》,《IEEE 模式分析与机器智能汇刊》,第 27 卷第 8 期,第 1226–1238 页,2005 年。 - ^ Auffarth, B., Lopez, M., Cerquides, J. (2010). Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images. Advances in Data Mining. Applications and Theoretical Aspects. p. 248--262. Springer. http://www.csc.kth.se/~auffarth/publications/redrel.pdf
奥法特、洛佩兹、塞尔奎德斯(2010)。《CT 图像组织分类中冗余与相关性特征选择度量比较》,《数据挖掘进展:应用与理论方面》,第 248--262 页。施普林格出版社。http://www.csc.kth.se/~auffarth/publications/redrel.pdf
External links 外部链接
[edit]- Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp. 1226–1238, 2005.
彭汉川、龙飞、丁毅,《基于互信息的特征选择:最大依赖性、最大相关性和最小冗余性准则》,《IEEE 模式分析与机器智能汇刊》,第 27 卷第 8 期,第 1226–1238 页,2005 年。 - Chris Ding and Hanchuan Peng, "Minimum Redundancy Feature Selection from Microarray Gene Expression Data". 2nd IEEE Computer Society Bioinformatics Conference (CSB 2003), 11–14 August 2003, Stanford, CA, USA. Pages 523–529.
丁铖与彭汉川,《微阵列基因表达数据的最小冗余特征选择》。第二届 IEEE 计算机学会生物信息学会议(CSB 2003),2003 年 8 月 11-14 日,美国加州斯坦福。第 523-529 页。 - Penglab mRMR 彭实验室 mRMR 方法