Evolutionary Relationship Inference using a CNN-based Approach
基于CNN的进化关系推理
The ERICA pipeline consists of 3 steps:
埃里卡管道包括3个步骤:
- Generating custom sequence files
生成自定义序列文件 - Evaluating evolutionary relationships
评估进化关系 - Finding discordant patterns and signatures of genome-wide and local introgression
发现全基因组和局部基因渗入的不一致模式和特征
ERICA uses a multiple sequence alignment (MSA) as input to evaluate evolutionary relationships. A DNA sequence alignment without any other annotation should be provided. Adjacent eight lines will be recognized as belonging to one taxon. Thus, MSAs containing 32 and 40 lines are required for the four-taxon and five-taxon analyses, respectively.
埃里卡使用多序列比对(MSA)作为输入来评估进化关系。应提供没有任何其他注释的DNA序列比对。相邻的8个品系将被认为属于一个分类单元,因此,包含32个和40个品系的MSA分别需要用于四个分类单元和五个分类单元的分析。
A demo of 8 sequence alignment:
一个8序列比对的演示:
TGGATTCGTCCGCCCAGGCACATCACAGAGCAATCAGACCTCGCAAACTA
TGGTTTCGTCCGCTCAGGCACATCACTGAGCAATCAGACCTCGCAAACTA
TGGTTTCGTCCGCCCAGGCACATCACAGAGCAATCAGACCTCGCNTACTA
TGGNTTCGTCCGCCCAGGCACATCACTGAGCAATCAGACCTCGCAAACTA
TGGTTTCGTCCGCTCAGGCACATCACTGAGCAATCAGACCTCGCAAACTA
TGGTTTCGTCCGCCCAGGCACATCACTGAGCAATCAGACCTCCCAAACTA
TGGATTCGTCCGCCCAGGCACNTCACTGAGCAATCAGACCTCGCATACTA
TGGATTCGTCCGCCCAGGCACATCACTGAGCAATCAGACCTCGCAAACTA
Users can use algorithms such as Clustal W or LASTZ for whole genome alignment. For a population-level study, genotype data storing in the format of Variant Call Format (VCF) can be used. Users can use vcf2MSA.py
to convert custom VCF files to MSA format.
用户可以使用Clustal W或LASTZ等算法进行全基因组比对。对于群体水平的研究,可以使用以变体调用格式(VCF)的格式存储的基因型数据。用户可以使用 vcf2MSA.py
将自定义VCF文件转换为MSA格式。
An example of VCF:
VCF的例子:
##fileformat=VCFv4.2
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT H_m_amaryllis_1 H_m_amaryllis_2 H_ethilla_1 H_ethilla_2 H_ethilla_3 H_ethilla_4 H_m_amaryllis_3 H_m_amaryllis_4 H_t_thelxinoe_1 H_t_thelxinoe_2 H_t_thelxinoe_3 H_t_thelxinoe_4 H_m_aglaope_1 H_m_aglaope_2 H_m_aglaope_3 H_m_aglaope_4
Hmel218003o 7295 . A T 138.17 . . GT:AD:DP:GQ:PL 0/0:14,0:14:36:0,36,491 0/0:10,0:10:24:0,24,332 0/0:24,0:24:63:0,63,827 0/1:6,6:12:99:150,0,194 0/0:13,0:13:39:0,39,512 0/0:26,0:26:66:0,66,880 0/0:20,0:20:60:0,60,742 0/0:33,0:33:99:0,99,1210 0/0:34,0:34:90:0,90,1198 0/0:23,0:23:69:0,69,912 0/0:25,0:25:69:0,69,953 0/0:19,0:19:54:0,54,752 ./.:.:.:.:. ./.:.:.:.:. 0/0:10,0:11:3:0,3,28 ./.:.:.:.:.
Hmel218003o 7315 . G C 27719.2 . . GT:AD:DP:GQ:PL 0/0:19,1:20:12:0,12,639 0/1:7,15:22:99:408,0,230 ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. ./.:.:.:.:. 0/1:15,10:25:99:251,0,442 0/0:27,0:27:78:0,78,944 1/1:1,29:30:84:1113,84,0 1/1:0,21:22:60:808,60,0 1/1:0,25:25:66:901,66,0 1/1:0,24:24:66:907,66,0 ./.:.:.:.:. 1/1:6,6:13:12:137,12,0 1/1:7,4:12:9:80,9,0 ./.:.:.:.:.
Hmel218003o 7319 . T A 1723.26 . . GT:AD:DP:GQ:PL 0/0:21,1:22:18:0,18,711 0/1:7,15:22:99:408,0,236 0/0:23,0:23:69:0,69,853 0/0:13,0:13:36:0,36,438 0/0:14,0:14:36:0,36,480 0/0:29,0:29:84:0,84,1058 0/1:15,10:25:99:263,0,443 0/0:29,0:29:84:0,84,1024 0/0:31,0:31:84:0,84,1139 0/0:23,0:23:69:0,69,930 0/0:26,0:26:69:0,69,945 0/0:24,0:24:66:0,66,910 ./.:.:.:.:. 0/1:7,7:15:10:156,0,10 1/1:9,4:13:9:80,9,0 ./.:.:.:.:.
Hmel218003o 7335 . C G 40227.2 . . GT:AD:DP:GQ:PL 1/1:1,16:17:9:580,9,0 0/1:16,8:24:99:264,0,410 1/1:0,20:20:60:742,60,0 1/1:0,13:13:39:456,39,0 1/1:0,12:12:30:393,30,0 1/1:0,28:28:81:1017,81,0 0/1:13,10:23:99:255,0,376 1/1:0,29:29:84:1016,84,0 1/1:0,30:30:87:1123,87,0 1/1:0,23:23:60:790,60,0 1/1:0,25:25:69:913,69,0 1/1:0,23:23:66:878,66,0 ./.:.:.:.:. 0/1:7,9:16:10:10,0,156 0/0:7,10:17:18:0,18,174 ./.:.:.:.:.
Hmel218003o 7342 . C T 7335.32 . . GT:AD:DP:GQ:PL 1/1:1,15:16:6:540,6,0 0/1:17,8:25:99:269,0,449 0/0:23,0:23:69:0,69,849 0/0:12,0:12:36:0,36,417 0/0:12,0:12:30:0,30,385 0/0:26,0:26:78:0,78,981 0/1:15,11:26:99:278,0,384 1/1:0,28:28:81:976,81,0 0/0:30,0:30:81:0,81,1086 0/0:23,0:24:63:0,63,804 0/0:25,0:25:66:0,66,856 0/0:25,0:25:69:0,69,908 ./.:.:.:.:. 0/1:7,9:16:10:10,0,160 0/0:10,8:18:24:0,24,253 ./.:.:.:.:.
For each MSA, we focus on the topology of its phylogenetic tree. Taking a four-taxon phylogeny for example, which contains three ingroup taxa and one outgroup taxon, there are three possible topological structures:
对于每个MSA,我们专注于其系统发育树的拓扑结构。以一个包含三个内群分类单元和一个外群分类单元的四分类单元系统为例,存在三种可能的拓扑结构:
/-P1 /-P1 /-P2
/-| /-| /-|
/-| \-P2 /-| \-P3 /-| \-P3
| | | | | |
--| \-P3 --| \-P2 --| \-P1
| | |
\-O \-O \-O
Topo A Topo B Topo C
(1,0,0) (0,1,0) (0,0,1)
However, one strict bifurcated tree cannot represent the real relationship of the ingroup taxa, especially when a focal taxon is not a monophyletic group or a focal region has multiple evolutionary histories. Thus, we generated training datasets for the CNN models by quantifying the proportion of each possible topology for MSA segments representing multiple evolutionary scenarios and encoded each MSA segment with a three-dimensional vector satisfying sum-to-one constraints.
然而,一个严格的分叉树不能代表内群分类单元的真实的关系,特别是当一个焦点分类单元不是一个单系群或一个焦点区域有多个进化历史。因此,我们通过量化代表多个进化场景的MSA片段的每个可能拓扑的比例来生成CNN模型的训练数据集,并使用满足和对一约束的三维向量对每个MSA片段进行编码。
Similarly, there are fifteen possible topologies for a five-taxon case, with twelve asymmetric and three symmetric topologies.
同样,对于五个分类单元的情况,有十五种可能的拓扑结构,其中十二种非对称拓扑结构和三种对称拓扑结构。
Class 类 | Topology 拓扑 |
---|---|
A | ((((P1, P2), P3), P4), O) (P1,P2),P3),P4),O) |
B | ((((P1, P3), P2), P4), O) (P1,P3),P2),P4),O) |
C | ((((P2, P3), P1), P4), O) (((((P2,P3),P1),P4),O) |
D | ((((P2, P3), P4), P1), O) (P2,P3),P4),P1),O) |
E | ((((P2, P4), P3), P1), O) (P2,P4),P3),P1),O) |
F | ((((P3, P4), P2), P1), O) (P3,P4),P2),P1),O) |
G | ((((P1, P3), P4), P2), O) (P1,P3),P4),P2),O) |
H | ((((P1, P4), P3), P2), O) (P1,P4),P3),P2),O) |
I | ((((P3, P4), P1), P2), O) (((((P3,P4),P1),P2),O) |
J | ((((P1, P2), P4), P3), O) (P1,P2),P4),P3),O) |
K | ((((P1, P4), P2), P3), O) (P1,P4),P2),P3),O) |
L | ((((P2, P4), P1), P3), O) (P2,P4),P1),P3),O) |
M | (((P1, P2), (P3, P4)), O) (((P1,P2),(P3,P4)),O) |
N | (((P1, P3), (P2, P4)), O) (((P1,P3),(P2,P4)),O) |
O | (((P1, P4), (P2, P3)), O) (((P1,P4),(P2,P3)),O) |
We have two trained CNN models covering most of the evolutionary scenarios. Scripts ERICAPrediction.py
can be used to predict topology probabilities for genomic data with a step window size of 5 kb.
我们有两个经过训练的CNN模型,涵盖了大多数进化场景。可以使用步长窗口大小为5kb的序列号0#来预测基因组数据的拓扑概率。
According to the relationships inferred by the CNN models, we can identify the introgressed regions via discordance between gene trees and species trees. The gene flow between non-sister species can change the topological structures, and the new topology depends on the species tree and direction of gene flow.
根据CNN模型推断的关系,我们可以通过基因树和物种树之间的不一致性来识别渐渗区域。非姐妹物种间的基因流会改变物种的拓扑结构,新的拓扑结构取决于物种树和基因流的方向。
However, both introgression and incomplete lineage sorting (ILS) can lead to discordant patterns. To distinguish signatures of introgression from ILS, we evaluated the theoretical distributions of topological discordance caused by ILS on the simulated data (see our paper for more details). We chose 0.4 as a threshold for 50 kb windows to make a false-positive rate (FPR) less than 5%.
然而,基因渗入和不完全谱系分选(ILS)都可能导致不一致的模式。为了区分渐渗与ILS的特征,我们评估了ILS对模拟数据引起的拓扑不一致的理论分布(更多细节参见我们的论文)。我们选择0.4作为50 kb窗口的阈值,以使假阳性率(FPR)小于5%。
The scripts ERICAVisualization.py
can be used to filter results by default or according to a user-defined threshold and plot topology proportions across the interesting regions.
脚本 ERICAVisualization.py
可用于在默认情况下或根据用户定义的阈值过滤结果,并绘制感兴趣区域的拓扑比例。
git clone https://github.com/YuboZhangPKU/ERICA.git
cd ERICA
Running ERICA pipeline requires only a standard computer with enough RAM, which is related to data size. The minimal requirement is about 3.3 GB of RAM.
运行埃里卡流水线只需要一台标准的计算机,有足够的RAM,这与数据大小有关。最低要求是大约3.3 GB的RAM。
Both CPUs and GPUs could be used, but GPUs significantly speed up the prediction step when compared to CPUs.
CPU和GPU都可以使用,但与CPU相比,GPU显著加快了预测步骤。
The following table shows the time and memory cost in prediction step for MSAs of different length:
下表显示了不同长度MSA的预测步骤中的时间和内存成本:
Model | 100 kb | 1 Mb 重试 错误原因 | 10 Mb 重试 错误原因 | 100 Mb 重试 错误原因 |
---|---|---|---|---|
Four taxon model (GPU) 四分类单元模型 |
0:29(3.3 Gb) 0:29(3.3 Gb) | 0:26(3.4 Gb) 0:26(3.4 Gb) | 1:37(4.5 Gb) 1:37(4.5 Gb) | 14:01(31.2 Gb) 14:01(31.2 Gb) |
Five taxon model (GPU) 五分类单元模型(GPU) |
0:45(3.3 Gb) 0:45(3.3 Gb) | 0:53(3.5 Gb) 0:53(3.5 Gb) | 2:37(5.3 Gb) 2:37(5.3 Gb) | 20:40(38.7 Gb) 20:40(38.7 Gb) |
Four taxon model (CPU) 四分类单元模型 |
2:25 | 17:43 | 165:57 | 1077:21 |
Five taxon model (CPU) 五个分类单元模型(CPU) |
1:55 | 9:36 | 86:46 | 857:31 |
(Times are minutes: seconds. Ran on one NVIDIA Tesla V100 SXM2 32GB GPU or two Intel Xeon E5-2680v3 CPUs using 20 threads)
(时间是分钟:秒。在一个NVIDIA Tesla V100 SXM 2 32 GB GPU或两个Intel Xeon E5- 2680 v3 CPU上运行(使用20个线程)
The pipeline has been tested on the following systems:
管道已在以下系统上进行了测试:
Linux: CentOS 7.5 and Gentoo 4.14
Linux:CentOS 7.5和Gentoo 4.14
ERICA pipeline requires python
version 3.6 or higher, and packages tensorflow
and plotnine
. For using NVIDIA GPU, CUDA and cuDNN are also required.
埃里卡pipeline需要 python
版本3.6或更高版本,以及软件包 tensorflow
和 plotnine
。使用NVIDIA GPU,还需要CUDA和cuDNN。
The following commands can create an environment using Anaconda:
以下命令可以使用Anaconda创建环境:
conda create --name ERICA python=3.6 tensorflow=2.1.0 plotnine=0.6.0
or for GPU version:
或者对于GPU版本:
conda create --name ERICA python=3.6 tensorflow-gpu==2.1.0 cudatoolkit cudnn plotnine=0.6.0
source activate ERICA
./test.sh
The script vcf2MSA.py
generates alignments of individual re-sequencing data from single nucleotide variations (SNVs) recorded in a VCF file and the corresponding reference sequences in the fasta format. Small insertions and deletions (INDELs) should be removed in advance to avoid sequence length variations.
脚本 vcf2MSA.py
根据VCF文件中记录的单核苷酸变异(SNV)和fasta格式的相应参考序列生成单个重测序数据的比对。应提前删除小插入和缺失(INDEL),以避免序列长度变化。
For different situations and hypotheses, four or five populations can be included, with the arguments--pop4
or --pop5
appointing the outgroup, respectively.
对于不同的情况和假设,可以包括四个或五个群体,参数 --pop4
或 --pop5
分别指定外群。
The program is designed for the diploid genome and insensitive to the order of alleles. Both phased and unphased genotype data can be used. If the --format
argument followed by the diplo
parameter, a pair of sequences will be generated with one sequence recording the allele in the first column, and another recording the second allele. Instead, if the parameter is haplo
, only one allele will be randomly retained and written into a sequence. Therefore, you can choose up to 4 or 8 individuals for each population, respectively. Note that it's not necessary to provide the maximum number of samples, the program can work with at least one individual per population, by randomly duplicating sequences of given individuals.
该程序是为二倍体基因组设计的,对等位基因的顺序不敏感。可以使用定相和非定相基因型数据。如果 --format
参数后面跟着 diplo
参数,则将生成一对序列,其中一个序列记录第一列中的等位基因,另一个记录第二列中的等位基因。相反,如果参数为 haplo
,则仅随机保留一个等位基因并将其写入序列中。因此,您可以为每个种群分别选择最多4个或8个个体。请注意,没有必要提供样本的最大数量,程序可以通过随机复制给定个体的序列来对每个群体中的至少一个个体进行操作。
The script will output an MSA file for each scaffold in the reference sequences. Users can use --include
or --exclude
arguments to specify the scaffolds for analyses.
vcf2MSA.py
has the following options:
- -h, --help Show the help message and exit.
- -i, --Input Path to an input VCF file. The compressed (gzipped) file is also supported.
- -o, --Output Output file name.
- -r, --Reference Path to reference fasta file, index (.fai) generated by samtools in the same directory is also required.
- -P1, --pop1 Sample names (separated by commas).
- -P2, --pop2 Sample names.
- -P3, --pop3 Sample names.
- -P4, --pop4 Sample names.
- -P5, --pop5 Sample names, optional.
- -f, --format Format of alleles to output, haplo or diplo.
- --exclude Name of scaffolds to exclude (separated by commas), optional.
- --include Name of scaffolds to include (separated by commas), optional.
An example is provided in test and the result in test_result.
The command lines for using vcf2MSA.py
are:
# four-taxon model
python vcf2MSA.py \
-i test/pop_test.vcf.gz \
-r test/pop_test.fasta \
-o test/pop_test \
-f diplo \
-P1 H_m_aglaope_1,H_m_aglaope_2,H_m_aglaope_3,H_m_aglaope_4 \
-P2 H_m_amaryllis_1,H_m_amaryllis_2,H_m_amaryllis_3,H_m_amaryllis_4 \
-P3 H_t_thelxinoe_1,H_t_thelxinoe_2,H_t_thelxinoe_3,H_t_thelxinoe_4 \
-P4 H_ethilla_1,H_ethilla_2,H_ethilla_3,H_ethilla_4
# five-taxon model
python vcf2MSA.py \
-i test/five_pop_test.vcf.gz \
-r test/five_pop_test.fasta \
-o test/five_pop_test \
-f haplo \
-P1 GP39,GP77,GP536,GP640,GP761-1 \
-P2 Nipponbare,HP14,HP44,HP48,HP314,HP103,HP45,UR28 \
-P3 W1943,W3095-2,Orufi,W3078-2 \
-P4 W0170,W1698,W1754,W0123-1,Oniva \
-P5 Obart
As mentioned above, the script ERICAPrediction.py
use 32 or 40 sequence alignments as input, and the --Input
argument specifies the path of input MSA files. If the path is a directory, all of the files that existed in this directory will be analyzed. The parameter after the argument --Tasks
specifies the number of jobs running in parallel.
The output of vcf2MSA.py
can be used as input in this step directly.
The program requires the model multiprocessing
and tensorflow
.
ERICAPrediction.py
has the following options:
- -h, --help Show the help message and exit.
- -i, --Input Path to input MSA file or directory.
- -o, --Output Output file name.
- -p, --Population The number of populations, 4 or 5.
- -t, --Tasks The number of tasks running in parallel.
- -m, --Model Name of trained CNN models used for prediction.
The command lines for analyzing the example data are:
# four-taxon model
python ERICAPrediction.py -i test/pop_test_Hmel218003o.txt -o test/pop_test_Hmel218003o_res.txt -p 4
# five-taxon model
python ERICAPrediction.py -i test/five_pop_test_10.txt -o test/five_pop_test_10_res.txt -p 5
The output file contains three or fifteen columns as the probabilities of each topology, and each line shows a result of a non-overlapped 5 kb window. An example results of four-taxon model:
0.2865 0.2967 0.4167
0.5867 0.2538 0.1595
0.7229 0.1028 0.1743
0.5597 0.2222 0.2181
0.5604 0.2345 0.2051
The script ERICAVisualization.py
uses topology proportions as input and plot topologies along a chromosome or given regions. The topology with the highest proportion or greater than a predefined threshold will be recorded, to suggest the putative loci of introgression.
The relationships between colors in output plots and topology follow:
The program requires model plotnine
and pandas
.
The program has the following options:
- -h, --help Show the help message and exit.
- -i, --Input Path to the input result file.
- -o, --Output Output prefix.
- -p, --Population Number of populations, 4 or 5.
- -w, --WindowSize Window size for outputting and plotting. Default = '50000'. The parameter should be divisible by 5000.
- -r, --Region 1-based indexs of focal regions, split by ':'. Default = '1:-1' (the full region).
- -d, --DistanceToZero Record topologies having proportion greater than the given threshold. Default = '0.40'. '?' indicates there is no topology passing through the threshold, which means a complex history.
- -m, --MaxValue Plot the highest supporting topologies along the chromosome. Default = 'True'.
- -l, --Line Line charts for each topology along the chromosome. Default = 'True'.
- -a, --Area Area charts for each topology along the chromosome. Default = 'True'.
- -c, --Chr Name for the chromosome, optional.
The following commands can be used to calculate the mean value for each 10 kb window, and to visualize the result.
# four-taxon model
python ERICAVisualization.py \
-i test/pop_test_Hmel218003o_res.txt \
-o test/pop_test_Hmel218003o_10k \
-p 4 \
-w 10000 \
-c Hmel218003o \
-r 1:200 \
-d 0.5
# five-taxon model
python ERICAVisualization.py \
-i test/five_pop_test_10_res.txt \
-o test/five_pop_test_10_10k \
-p 5 \
-w 10000 \
-c Chr10 \
-r 1:200 \
-d 0.5
An example of output CSV file pop_test_Hmel218003o_10k.csv
records the topology proportions and category information.
Chr | Index | A | B | C | CminusB | Max Value | Max Class | Distance Class |
---|---|---|---|---|---|---|---|---|
Hmel218003o | 1 | 0.4366 | 0.2752 | 0.2881 | 0.0129 | 0.4366 | A | ? |
Hmel218003o | 10001 | 0.6413 | 0.1625 | 0.1962 | 0.0337 | 0.6413 | A | A |
Hmel218003o | 20001 | 0.7142 | 0.1474 | 0.1383 | -0.0092 | 0.7142 | A | A |
Hmel218003o | 30001 | 0.6077 | 0.1448 | 0.2475 | 0.1027 | 0.6077 | A | A |
Custom data could be used to train the CNNs to further improve accuracy. The script ERICAModelTraining.py
uses MSA files and data labels with the same format as described above as input to train the models from scratch or fine-tune the pre-trained models.
ERICAModelTraining.py
has the following options:
- -h, --help Show the help message and exit.
- -i, --Input Path to input multiple sequence alignment (MSA) file or file list, split by comma.
- -l, --Label Path to input label file or file list, split by comma.
- -o, --Output Output file name.
- -p, --Population The number of populations, 4 or 5.
- -m, --Model Name of trained CNN models used for fine-tuning.
- -e, --Epochs The number of epochs used for model training. Default = '3'.
- -b, --Batch Batch size. Default = '12'.
- -r, --Rate Learning rate. Default = '0.0001'.
- --Iteration Number of iterations. Training process will stop after n iterations if the parameter is specified.
The command lines for training models using demo data are:
# training from scratch
# four-taxon
python ERICAModelTraining.py -i test/four_pop_test_msa.txt -l test/four_pop_test.label -p 4 -o four_pop_test_model -b 8 --Iteration 10
# five-taxon
python ERICAModelTraining.py -i test/five_pop_test_msa.txt -l test/five_pop_test.label -p 5 -o five_pop_test_model -b 8 --Iteration 10
# fine-tuning the pre-trained model
# four-taxon
python ERICAModelTraining.py -i test/four_pop_test_msa.txt -l test/four_pop_test.label -p 4 -o four_pop_test_ft_model -m TrainedModels/four_taxon_model_319200 -b 8 --Iteration 10
# five-taxon
python ERICAModelTraining.py -i test/five_pop_test_msa.txt -l test/five_pop_test.label -p 5 -o five_pop_test_ft_model -m TrainedModels/five_taxon_model_660600 -b 8 --Iteration 10