This is a bilingual snapshot page saved by the user at 2025-1-27 23:30 for https://app.immersivetranslate.com/word/, provided with bilingual support by Immersive Translate. Learn how to save?

Summary

Keywords

1.Introduction
1. Introduction

1.1restatement of problem
1.1 restatement of the problem

奥林匹克运动会,作为全球规模最为庞大、影响力最为广泛的体育盛事,不仅彰显了世界各国运动员的卓越风采,同时也推动了文化的交流与融合。奥林匹克运动会的奖牌榜,作为对各国竞赛成果的综合展示,既反映了各国在奥林匹克运动会上的表现,亦体现了各国的综合国力。尽管国际奥林匹克委员会并不承认国家奖牌排名,然而奖牌榜的持续更新,仍然对媒体报道及公众利益产生着深远影响[1]。鉴于此,如何更精准地预测未来奥林匹克运动会奖牌在各国的数量与分布状况,已成为众多人士关注的焦点。
The Olympic Games, as the largest and most influential sports event globally, not only showcases the outstanding performance of athletes from various countries but also promotes cultural exchange and integration. The medal table of the Olympic Games, as a comprehensive display of the competitive results of countries, reflects both the performance of countries at the Olympic Games and their comprehensive national strength. Although the International Olympic Committee does not recognize the ranking of national medals, the continuous updating of the medal table still has a profound impact on media reports and public interest [1]. In view of this, how to more accurately predict the number and distribution of medals at the future Olympic Games in various countries has become a focus of attention for many people.

针对于对奥运奖牌的预测,考虑到题目所给的数据集与限制条件,我们需要完成以下任务:
Regarding the prediction of Olympic medals, considering the dataset and constraints provided in the question, we need to complete the following tasks:

建立一个预测奖牌数的模型,预估每个国家的金牌和总奖牌数,并对模型的精度进行估计,衡量其好坏。
Establish a model to predict the number of medals, estimate the number of gold and total medals for each country, and estimate the accuracy of the model to measure its quality.

根据所建立模型,
Based on the established model,

预测2028年洛杉矶奥运会奖牌榜情况,及各国表现可能出现的变化。
Forecast the medal table situation of the 2028 Los Angeles Olympics and the possible changes in the performance of various countries.

对于尚未获得过奖牌的国家,预测有哪些会在下届奥运会获得首枚奖牌。
For countries that have never won a medal, what predictions are there for which ones will win their first medal at the next Olympics?

探讨不同赛事与国家奖牌数的关系,并分析国家选择项目对最终结果的影响。
Explore the relationship between different competitions and the number of national medals, and analyze the impact of the projects chosen by the countries on the final results.

根据题目所给数据寻找可能归因于“great coach” effect的那些变化方面的证据,估计其对奖牌数的影响,选择国家针对运动项目进行估算。
Evidence of changes possibly attributed to the "great coach" effect is sought based on the data provided in the question, the impact on the number of medals is estimated, and estimates are made for specific countries in relation to sports events.

解释模型所揭示的关于奥运奖牌数的原创性见解。
Originality insights into the number of Olympic medals revealed by the explanation model.

1.2literature review
1.2 Literature review

奥运会奖牌的预测对于各国政府以及公民是非常重要的,他们可以此为基准评估国家在奥运会上的表现是否成功。同样,其对于许多不同的非政府利益相关者也非常重要。在过去的几十年中,对于奥运会奖牌的预测,已经有过学者进行过一些研究。
The prediction of Olympic medals is very important for governments and citizens of various countries, as they can use it as a benchmark to evaluate whether the country's performance at the Olympics is successful. Similarly, it is also very important for many different non-governmental stakeholders. In the past few decades, scholars have conducted some research on the prediction of Olympic medals.

Ball[2]建立了correlation-based scoring model,成为了对奥运会预测的先驱。在之后,预测的模型逐渐变得复杂。起初,大多数研究者使用普遍最小二乘回归(OLS)来进行预测,因为其提供的结果易于解释,如condon等[3]。但是,在预测奥运会奖牌时,正确预测大量没有获得奖牌的国家是一个非常复杂的问题。对此,一些研究者采用Poisson model, negative binomial model等模型来解决此问题[4][5]。Bernard and Busse[6]比较不同的计量经济学方法,发现Tobit模型对于预测奖牌数效果更好,这也使得近年来更多的研究者采用Tobit回归来进行预测[7]。此外,还有一些其他方法的研究,如采用Hurdle模型[8]等,此处不再过多赘述。
Ball[2] established a correlation-based scoring model and became a pioneer in Olympic predictions. Subsequently, the prediction models gradually became more complex. Initially, most researchers used ordinary least squares regression (OLS) for predictions because the results were easy to interpret, as in condon et al.[3]. However, predicting the number of Olympic medals for a large number of countries that did not win any medals is a very complex problem. To address this, some researchers adopted models such as the Poisson model and negative binomial model[4][5]. Bernard and Busse[6] compared different econometric methods and found that the Tobit model is more effective for predicting the number of medals, which has led to more researchers using Tobit regression for predictions in recent years[7]. In addition, there are also studies on other methods, such as the Hurdle model[8], which will not be elaborated on further.

尽管预测的方法在不断的改进与完善,预测的精度仍然有一定的提高空间。
Even though prediction methods are constantly being improved and refined, there is still room for improvement in prediction accuracy.

1.3our work
1.3 Our work

2.Assumptions and Justifications
2. Assumptions and Justifications

题目中所给数据是真实可信的
The data given in the question is real and credible

如果题目所给数据不够真实可信,那么我们依此得出的结果就不具有说服力,故本假设是必要的。
If the data provided in the question is not real and credible, then the results we obtain based on this are not persuasive, so this assumption is necessary.

忽略特殊情况的影响,包括裁判不公平判罚与运动员服用兴奋剂等。
Neglect the impact of special cases, including unfair judgments by referees and athletes taking stimulants.

假设中列举的以及更多可能发生的特殊因素具有偶然性,极难甚至无法进行估计,因此,忽略这些特殊情况可以起到简化计算与分析的效果。
The special factors listed in the assumption and more possible ones have an element of chance, which are extremely difficult to even estimate, so ignoring these special cases can simplify the calculation and analysis.

3.Notations
3. Notations

4. Data Pre-processing
4 Data Pre-processing

summerOly_medal_counts中,为了方便我们后来的数据处理,我们将NOC列转化为了各国的字母缩写,如china变为CHN等。在summerOly_programs数据文件里,我们发现部分项目的项目数、Sports Governing Body和Code出现缺失,我们通过查找网络数据,填补了Sports Governing Body和Code的缺失值,再依据表中Total event行的数据计算出每个项目缺失的数量。至此,我们完成了对题目所提供数据的预处理。
In the summerOly_medal_counts, for the convenience of our later data processing, we have converted the NOC column into the alphabetical abbreviations of each country, such as china to CHN, etc. In the summerOly_programs data file, we found that some projects were missing in the number of projects, Sports Governing Body, and Code. We filled in the missing values of Sports Governing Body and Code by searching for network data, and then calculated the missing number of each project based on the data of the Total event row in the table. Up to this point, we have completed the preprocessing of the data provided in the question.

5.基于多维度变量的两阶段随机森林模型The Two-Stage Random Forest Model Based on Multidimensional VariablesTSRF-MV
The Two-Stage Random Forest Model Based on Multidimensional Variables (TSRF-MV)

对于奥运会奖牌榜各国的奖牌数而言,因为有很多国家的奖牌数都是0,所以如果用普通的回归模型来进行预测,则会导致无法捕捉到一些阈值特征等问题,影响预测的精准度。为了减少这种影响,我们选择采用其他的模型。我们首先构建了三种模型,分别是:Tobit模型、Hurdle模型、以及基于多维度变量的两阶段随机森林模型。通过对各个模型的预测精度进行评价与比较,我们最终选择基于多维度变量的两阶段随机森林模型模型作为奖牌预测模型分别对金牌数与总奖牌数进行预测。
For the number of medals of various countries on the Olympic medal table, because many countries have a medal count of 0, using a common regression model for prediction would lead to issues such as failing to capture certain threshold features, affecting the accuracy of the prediction. To reduce this impact, we choose to use other models. We first constructed three models, namely: the Tobit model, the Hurdle model, and the two-stage random forest model based on multi-dimensional variables. By evaluating and comparing the prediction accuracy of each model, we finally selected the two-stage random forest model based on multi-dimensional variables as the medal prediction model to predict the number of gold medals and the total number of medals separately.

5.1变量确定
5.1 Variable Determination

对于以下各模型,我们确定出如下变量:
For the following models, we determine the following variables:

gold_medal/total_medal为将要预测的奖牌数和金牌数,是目标变量。
Gold medal/total medal refers to the number of medals and gold medals to be predicted, which is the target variable.

join event/participant表示每个国家在当年参加的项目总数和参赛者人数
The "join event/participant" indicates the total number of projects and participants each country participates in that year.

host:该国当年是否为东道主国家,若是则为1,若不是则为0
Host: Whether the country was the host country that year, 1 if it was, 0 if it was not.

last medal/gold count:该国上届奥运会获得的总奖牌数和总金牌数。
Total medals/gold medals count: The total number of medals and gold medals won by the country in the last Olympic Games.

Sport:所选项目类型。
Sport: The selected item type.

host count表示该国在历史(截止2000年)上曾经举办过的奥运会的次数,
The "host count" indicates the number of times the country has hosted the Olympics in history (up to 2000)

history participant为该国历史上(截止2000年)参赛者数量,
history participant refers to the number of participants in the country's history (up to 2000)

history medal为该国历史上(截止2000年)获得过的奖牌总数。
The "history medal" refers to the total number of medals won in the country's history (up to 2000).

除此之外,通过查阅文献,我们团队认为还应该考虑这个国家的体育历史文化积累因素。
In addition, after reviewing the literature, our team believes that the historical and cultural accumulation of this country's sports history should also be considered.

5.1.1体育历史文化积累因素量化
Quantitative factors of sports historical and cultural accumulation

为了量化各国的体育历史文化积累因素这个变量,我们建立了体育历史文化积累评分模型。我们找来了如下222个国家的数据,如下表(仅展示部分):
To quantify the variable of sports historical and cultural accumulation factors in various countries, we established a sports historical and cultural accumulation scoring model. We collected data from the following 222 countries, as shown in the table (only part is displayed):

其中,NOC为国家代码,host count为该国举办过的总奥运会数量,history participant为历史参赛者数量,history medal为历史上获得过的奖牌数。
Among them, NOC is the country code, host count is the total number of Olympic Games hosted by the country, history participant is the number of participants in history, and history medal is the number of medals won in history.

我们建立了一个了一个基于国家历史奖牌数、历史参赛者数和历史举办奥运会数的三个指标的熵权法评分模型,首先对包含222个国家样本的特征矩阵进行标准化,通过各个特征变量之间的离散程度确定各个指标包含的信息量,进而得到上述三个指标的权重,用标准化后的矩阵乘上之前的权重,最终得到各个国家的得分。这里我们采用min-max标准化的方式,将三个特征的所有样本值化到[0,1]之间,接着,计算特征的比例矩阵P,其计算公式如下:
We established an entropy weight scoring model based on three indicators: the number of historical national medals, the number of historical participants, and the number of historical Olympic Games. First, we standardized the feature matrix containing 222 country samples, determined the information content of each indicator through the degree of dispersion between the feature variables, and then obtained the weights of the above three indicators. By multiplying the standardized matrix with the previous weights, we finally obtained the scores of each country. Here, we adopted the min-max standardization method to convert all sample values of the three features to the interval [0,1]. Next, we calculated the feature proportion matrix P, with the following calculation formula:

P=xx

其中Xij表示第i个样本第j个特征的值,本模型中特征数为3,所以j最大为3。接着计算每个特征的信息熵Ej,其计算公式如下:
Xij represents the value of the jth feature of the ith sample, and the number of features in this model is 3, so j can be at most 3. Next, calculate the information entropy Ej of each feature, with the calculation formula as follows:

E=−1(n)p(p)

其中n表示样本数及本模型的国家数。接着将Ej转化为极大型指标dj:
Among them, n represents the number of samples and the number of countries in this model. Next, Ej is transformed into an extremely large indicator dj:

=1−E

最后对dj进行归一化得到最终权重:
Finally, normalize dj to obtain the final weight:

w=

其中m表示总特征数,本模型中值为3。最终将标准化后的数据矩阵乘上其对应权重得到各个国家的得分,部分国家的得分展示如下:
Among them, m represents the total number of features, which is 3 in this model. The final scores of various countries are obtained by multiplying the standardized data matrix by its corresponding weight, and the scores of some countries are displayed as follows:

5.2 Tobit模型的建立
Establishment of the Tobit Model

Tobit模型是用于处理截断或右限(censored)数据的一种回归模型。它通常用于处理那些存在被截断或受限的因变量数据集,例如一部分观测值等于0数据集。Tobit模型通常假设因变量是由潜在的连续变量(通常是未观察到的)通过某个截断过程生成的。我们使用广义线性模型(GLM)来实现对数据的截断,并用泊松分布来建模目标变量(奖牌数)。在GLM中,我们选择线性链接方式,其公式如下:
E(Y)=Xβ+ε
The Tobit model is a regression model used to handle censored or right-censored data. It is commonly used to process datasets with truncated or restricted dependent variable data, such as datasets where some observations are equal to 0. The Tobit model typically assumes that the dependent variable is generated by a potential continuous variable (usually unobserved) through some truncation process. We use the Generalized Linear Model (GLM) to implement the truncation of the data and use the Poisson distribution to model the target variable (number of medals). In the GLM, we choose the linear link function, with the formula as follows:
E(Y)=Xβ+ε

Yit=~大于0,等于0的一个方程
Yit is greater than 0, an equation equal to 0

其中,其中Yit表示第i个国家在第t年的奥运会上的奖牌数,Xit为第i个国家在该年的特征向量,β为模型的参数向量。ε为模型的误差项。,
Among them, Yit represents the number of medals won by the i-th country in the Olympic Games in the t-th year, Xit is the feature vector of the i-th country in that year, β is the parameter vector of the model, and ε is the error term of the model.

5.3 Hurdle模型的建立
Establishment of the Hurdle model

Hurdle模型(跨越障碍模型)是一种用于处理零膨胀数据(zero-inflated data)的问题的回归模型。它特别适用于那些因变量包含大量零值的情况。其核心是将因变量的分布分解为两个部分:一个部分是零的发生概率,即判断某个观测值是否为零;另一个部分是非零值的分布,即在非零值出现时,如何描述它们的分布。
Hurdle model (hurdle model) is a regression model used to address issues related to zero-inflated data. It is particularly suitable for situations where the dependent variable contains a large number of zeros. Its core is to decompose the distribution of the dependent variable into two parts: one part is the probability of zero occurrence, that is, to determine whether an observation is zero; the other part is the distribution of non-zero values, that is, how to describe their distribution when non-zero values occur.

对于第一个部分,我们通过一个逻辑回归函数来建模,其公式如下:
For the first part, we model it using a logistic regression function, with the formula as follows:

P(zⅇr)=11−(Xβ)

其中,β表示零膨胀部分的参数,X表示标准化后的样本矩阵。
Among them, β represents the parameters of the zero-inflation part, and X represents the standardized sample matrix.

对于第二部分,我们使用泊松分布来建模目标变量(奖牌数),其公式如下:
For the second part, we use the Poisson distribution to model the target variable (medal count), with the formula as follows:

p(y|λ)=λ(−λ)y!

λ=(Xβ)

其中,yi,t为我们目标变量(奖牌数),λ为泊松分布的参数,β1为泊松回归部分的回归系数。最终模型综合了零膨胀部分和泊松回归部分,公式为:
Among them, yi,t is our target variable (number of medals), λ is the parameter of the Poisson distribution, and β1 is the regression coefficient of the Poisson regression part. The final model integrates the zero-inflated part and the Poisson regression part, with the formula:

p(y)=p(zⅇro)1+(1−p(zero))λexp⁡(−λ)y!

1是指是指示函数,yi​,t=0 时为1,否则为0。通过最小化负对数似然函数,模型估计出参数ββ,进而完成模型的拟合过程。
1 refers to the indicator function, which is 1 when yi, t=0 and 0 otherwise. By minimizing the negative log-likelihood function, the model estimates the parameters β and β , thereby completing the model fitting process.

5.4两阶段随机森林模型的建立
Establishment of a two-stage random forest model

我们将Hurdle模型的将预测分为两个部分的思路应用于机器学习的模型中,即我们首先训练出一个分类器,对经预测是否获得奖牌的国家进行区分;接着,我们再训练一个回归器,预测出获得奖牌国家所获得的具体奖牌数。
We apply the idea of dividing the predictions of the Hurdle model into two parts to machine learning models, that is, we first train a classifier to distinguish between countries that are predicted to win medals; then, we train a regressor to predict the specific number of medals won by the medal-winning countries.

随机森林是一种集成学习方法,其是由多个决策树构成的森林,有许多研究证明了随机森林模型在体育领域的预测上的有效性[9][10],因此在对机器学习模型的选择上,在每一个阶段,我们均选择随机森林模型。我们将上述所提到的对样本提取的各维度变量均作为样本的特征向量进行分类与预测。在分类器中,我们设置决策树的数量为10,最大深度为4;在回归器中,我们设置决策树的数量为1000,最大深度为5。
Random forests are a type of ensemble learning method, consisting of a forest of multiple decision trees. Many studies have demonstrated the effectiveness of the random forest model in sports predictions [9][10], therefore, in the selection of machine learning models, we choose the random forest model at each stage. We take all the dimensions of the sample extraction mentioned above as feature vectors for classification and prediction. In the classifier, we set the number of decision trees to 10 and the maximum depth to 4; in the regressor, we set the number of decision trees to 1000 and the maximum depth to 5.

我们采用交叉验证来避免模型出现过拟合的情况,在这里,我们规定验证集为所选数据范围内最后一届奥运会样本,训练集为此前的所有奥运会样本。这是为了避免出现预测集中样本所在的时间在训练集中某些样本之前的情况,例如测试集是2008年奥运会,但是训练集中有2008年之后的奥运会的样本。
We use cross-validation to avoid the model from overfitting. Here, we define the validation set as the sample of the last Olympic Games within the selected data range, and the training set as all the Olympic Games samples before that. This is to prevent the situation where the time of the samples in the prediction set is before some samples in the training set, for example, if the test set is the 2008 Olympic Games, but there are samples from the Olympic Games after 2008 in the training set.

5.5对模型的测试与选择
5.5 Model Testing and Selection

训练和测试数据的选择,我们选择了2004-2020年的数据作为我们三个模型的训练数据,2024年的数据作为测试数据,输入数据变量特征包括host、join event、join participant、score、last medal count和total medal,其中host、join event、join participant、score、last medal count作为标签变量,total medal作为目标变量。我们采用MSE(Mean Squared Error)MAE(Mean Absolute Error),RMSE(Root Mean Squared Error)R-squared)这四个指标来衡量模型的性能。
The selection of training and test data: We chose data from 2004 to 2020 as the training data for our three models and data from 2024 as the test data. The input data variables include host, join event, join participant, score, last medal count, and total medal, among which host, join event, join participant, score, and last medal count are label variables, and total medal is the target variable. We use MSE (Mean Squared Error), MAE (Mean Absolute Error), RMSE (Root Mean Squared Error), and R² (R-squared) as four indicators to measure the performance of the model.

Tobit模型:MSE=39.848;MAE=3.887;RMSE=6.312;R²=0.926
Tobit model: MSE=39.848; MAE=3.887; RMSE=6.312; R²=0.926

Hurdle模型:MSE=26.818;MAE=2.970;RMSE=5.178;R²=0.950
Hurdle model: MSE=26.818; MAE=2.970; RMSE=5.178; R²=0.950

TSRF-MV model:MSE=22.287,MAE=2.567,RMSE=5.124, R²=0.968
TSRF-MV model: MSE=22.287, MAE=2.567, RMSE=5.124, R²=0.968

分别观察各指标,我们认为三种模型预测效果都较为优秀。将三个模型的MSE、R2、MAE以及RMSE这四个指标进行对比如下图所示,我们可以看到TSRF-MV model的MSE、MAE以及RMSE均为最小,R2最大,说明相比其它两种模型,其误差最小,并且回归效果最好,因此我们认为TSRF-MV model在预测的性能上是最优的,因此我们选择TSRF-MV model作为我们的奖牌预测模型。
Each indicator is observed separately, and we believe that the prediction effect of the three models is relatively good. The comparison of the four indicators MSE, R2, MAE, and RMSE of the three models is shown in the figure below. We can see that the MSE, MAE, and RMSE of the TSRF-MV model are the smallest, and the R2 is the largest, indicating that compared to the other two models, its error is the smallest and the regression effect is the best. Therefore, we believe that the TSRF-MV model is the optimal in terms of prediction performance, and we choose the TSRF-MV model as our medal prediction model.

5.6TSRF-MV model的最终构建
5.6 Final construction of the TSRF-MV model

在比较模型性能时,并没有将Sport变量考虑进去,因为项目数过多,有可能导致Tobit模型Hurdle模型发生过度拟合,或因为样本数量不足以支撑这么多特征,导致估计过程失败。因此没有将其用到Tobit模型Hurdle模型中,为了与它们保持一致,TSRF-MV model也没有考虑Sport变量。而在选择TSRF-MV model后,我们认为加入Sport变量,考虑参赛项目的影响,可以进一步提高该模型的性能。
When comparing model performance, the Sport variable was not considered because there were too many projects, which might lead to overfitting of the Tobit model and the Hurdle model, or because the sample size was not sufficient to support so many features, leading to failure in the estimation process. Therefore, it was not used in the Tobit model and the Hurdle model. To maintain consistency with them, the TSRF-MV model did not consider the Sport variable either. After selecting the TSRF-MV model, we believe that including the Sport variable and considering the impact of the number of events can further improve the performance of the model.

由于Sport是字符型变量,无法直接输入模型进行计算,所以我们用独热编码的方式将Sport转化为01变量我们根据Sport的标签的数量产生新的特征如Sport_Athletics、Sport_Judo等。比如当年样本国家选择了Athletics这项运动,则该样本国的Sport_Athletics特征的值置为1,其他Sport...特征的值被置为0,这样就能考虑到项目类型对于奖牌数和金牌数的影响
Since Sport is a character type variable, it cannot be directly input into the model for calculation, so we use one-hot encoding to convert Sport into a 01 variable. We generate new features such as Sport_Athletics, Sport_Judo, etc., based on the number of Sport labels. For example, if the sample country selected the sport of Athletics in that year, the value of the Sport_Athletics feature for that sample country is set to 1, and the values of other "Sport..." features are set to 0, so that the impact of the project type on the number of medals and gold medals can be considered.

同时,考虑到可能存在各国参赛者数和参加项目数量未知的情况,我们构建了灰色预测模型(GM(1,1))用于在各国参赛者数和参加项目数未知时对其进行预测。
At the same time, considering the possibility that the number of participants and the number of projects from various countries may be unknown, we have constructed a grey forecasting model (GM(1,1)) to predict them when the number of participants and the number of projects from various countries are unknown.

我们同样选择了2004-2020年的数据作为我们TSRF-MV模型的训练数据,2024年的数据作为测试数据。
We also selected the data from 2004 to 2020 as the training data for our TSRF-MV model, and the data from 2024 as the test data.

对于第一阶段的分类器,计算得到的测试集的混淆矩阵和ROC-AUC曲线展示如下,据此我们可以计算出分类模型精确度为0.99,召回率接近1,f1分数为0.99,模型的预测效果相当好,且没有出现过拟合。
For the classifier in the first stage, the confusion matrix and ROC-AUC curve of the calculated test set are shown as follows. Based on this, we can calculate that the classification model has an accuracy of 0.99, a recall rate close to 1, and an F1 score of 0.99. The prediction effect of the model is quite good, and there has been no overfitting.

对于第二阶段的回归器,我们分别对总奖牌数与金牌数进行训练验证,可得如下预测曲线图和残差图。
For the regressor in the second stage, we trained and verified the total medal count and gold medal count separately, and obtained the following prediction curve and residual plot.

对于总奖牌数,在2024年的数据集上,MSE为7.9414,MAE为1.4186,RMSE为2.8181,不难推知,模型在对于2024年各个国家奖牌数的平均预测误差为1.4左右,即大概1.4枚奖牌的误差。通过计算奖牌榜前20名国家2024年的平均奖牌数可知,误差比约为0.04,这在可接受范围内,模型预测效果较为良好。
For the total number of medals, on the dataset of 2024, MSE is 7.9414, MAE is 1.4186, RMSE is 2.8181. It is not difficult to deduce that the average prediction error of the model for the number of medals of various countries in 2024 is about 1.4, which is approximately an error of 1.4 medals. By calculating the average number of medals of the top 20 countries in the medal table of 2024, the error ratio is about 0.04, which is within the acceptable range, and the model's prediction effect is relatively good.

对于金牌数,在2024年的数据集上,MSE为1.3318,MAE为0.6538,RMSE为1.1541,不难推知,模型在对于2024年各个国家奖牌数的平均预测误差为0.6左右,及大概0.6枚奖牌的误差。通过计算奖牌榜前20名国家2024年的平均金牌数可知,误差比约为0.05,这在可接受范围内,模型预测效果较为良好。
For the number of gold medals, on the dataset of 2024, the MSE is 1.3318, the MAE is 0.6538, and the RMSE is 1.1541. It is not difficult to deduce that the average prediction error of the model for the number of gold medals of various countries in 2024 is about 0.6, and the error is about 0.6 medals. By calculating the average number of gold medals of the top 20 countries in the medal table of 2024, the error ratio is about 0.05, which is within the acceptable range, and the model's prediction effect is relatively good.

因此,我们认为加上Sport变量后模型的性能有了进一步提高,同时还没有出现过拟合,模型的表现非常好。至此,我们的TSRF-MV model全部构建完成。
Therefore, we believe that the performance of the model has been further improved after adding the Sport variable, and there has been no overfitting, and the model's performance is very good. Up to now, our TSRF-MV model has been fully constructed.

6.对未来奥运会奖牌数的预测与赛事重要性分析
6. Forecast of future Olympic medal numbers and analysis of event importance

6.1 2028年奥运会奖牌榜情况
6.1 The medal table situation of the 2028 Olympics

6.1.1奖牌榜预测
Medal leaderboard forecast

我们采用TSRF-MV model对各国的金牌数与奖牌总数进行预测,选择2004年到2024年所有国家的数据进行训练。奖牌榜的排名以金牌数优先,同样金牌数的情况下,我们比较总奖牌数量。由于所包含的国家较多以及篇幅限制,我们只展示排名在前20位的国家的数据。具体奖牌数据如下图所示:
We adopt the TSRF-MV model to predict the number of gold medals and the total number of medals for each country, and select the data of all countries from 2004 to 2024 for training. The ranking of the medal table is based on the number of gold medals first, and in the case of the same number of gold medals, we compare the total number of medals. Due to the large number of countries involved and space limitations, we only show the data of the top 20 ranked countries. The specific medal data is shown in the figure below:

采用t检验,得到奖牌在90%置信度下的预测区间,我们对此做出了可视化,如下图所示。
Using the t-test, we obtained the prediction interval for the medals at a 90% confidence level, and we visualized it as shown in the figure below.

6.1.2各国成绩变化
6.1.2 Changes in national scores

对于国家在2028届奥运会上的表现相较于2024是更好还是更差,我们采取两个指标:
For the performance of the country at the 2028 Olympics compared to 2024, we use two indicators:

对于两届奥运会金牌数不同的情况,我们认为若金牌数增加则国家的成绩有所提升,反之则降低。
In the case of different gold medal counts in consecutive Olympic Games, we believe that an increase in the number of gold medals indicates an improvement in the country's performance, while a decrease suggests a decline.

对于两届奥运会金牌数相同的情况,我们认为若总奖牌数增加则国家的成绩有所提升,反之则降低。
In the case of the same number of gold medals in two consecutive Olympic Games, we believe that the country's performance improves if the total number of medals increases, and vice versa.

同样,我们只对排名前20位的国家成绩进行变化情况展示,根据2024年数据,我们可以得到如下对比图:
Similarly, we only display the change in performance of the top 20 countries, and according to the 2024 data, we can obtain the following comparative chart:

该20个国家表现好坏的具体情况,总结如下表所示:
The specific performance of these 20 countries is summarized as follows in the table:

Better

Worse

6.2尚未获得奖牌国家的情况估计
6.2 Estimation of countries that have not won any medals

我们的预测样本为截止2024年尚未在奥运会上获得过奖牌的国家,共94个。我们预测样本的时间点为2028年。在这里,我们调用TSRF-MV model的第一部分:随机森林分类器,用于预测这些国家是否会获得他们第一枚奖牌,我们按照分类器奖牌获得的概率进行降序排列,得到前十个国家的数据如下图所示,从图中可以看出,我们的模型预测这些尚未获得过奖牌的国家在2028年奥运会上获得奖牌的概率最大为0.094,意味着模型认为他们当中没有一个能够在下一届奥运会上获得他们的第一枚奖牌。
Our prediction sample consists of 94 countries that have not won any medals at the Olympics by the end of 2024. The time point of our prediction sample is 2028. Here, we call the first part of the TSRF-MV model: the random forest classifier, to predict whether these countries will win their first medal. We sort the data of the top ten countries in descending order of the probability of winning a medal according to the classifier, as shown in the figure below. From the figure, it can be seen that our model predicts that the probability of these countries winning a medal at the 2028 Olympics is at most 0.094, meaning that the model believes that none of them will win their first medal at the next Olympics.

6.3评估选择项目对国家的重要性
6.3 Evaluate the importance of the selection items to the country

为了进一步说明某个国家所选项目类型与其奖牌数的关系,我们找来了各个国家从1992年到2024年在各个单一项目(Sport)上获得的奖牌总数数据,该项目(Sport)在19922024年奥运会间开展的小项目数数据(event),基于这两个指标,我们建立了一个评分模型,用来衡量各个项目对于某个国家的重要程度(与上文提到的熵权法模型相同,这里不再赘述原理公式)我们认为项目得分越高,其对奖牌数的贡献越大。
To further illustrate the relationship between the type of project selected by a country and the number of medals won, we have collected the total number of medals won by each country in various single sports (Sport) from 1992 to 2024, as well as the number of small events (event) held in the Olympics between 1992 and 2024 for each sport. Based on these two indicators, we have established a scoring model to measure the importance of each project for a country (similar to the entropy weight model mentioned in the previous text, the principle and formula are not elaborated here), believing that the higher the project score, the greater its contribution to the number of medals.

我们选取了三个国家USA、CHN、AUS作为例子进行分数计算下图展示了在数据分析中分别依据其所获金牌数,总奖牌数,以及包含小项目数对各个单一项目进行排序的部分数据
We selected three countries, USA, CHN, and AUS, as examples for score calculation. The following figure shows part of the data sorted for each individual event based on the number of gold medals, total medals, and the number of small events involved in the data analysis:

经计算,我们求得了这三个国家各项目得分,排名在前10位的项目如下图所示。对于这三个国家而言,对应的得分前10位项目,对其奖牌数贡献最大。
After calculation, we have obtained the scores of each project for these three countries, and the top 10 projects are shown in the figure below. For these three countries, the top 10 scoring projects contribute the most to their medal counts.

7. 伟大教练效应预测与量化模型Great Coach Effect Prediction and Quantification Model
Great Coach Effect Prediction and Quantification Model

在体育比赛中,教练的更替可能会对运动员的表现产生显著影响,尤其是当新教练为“伟大教练”时,我们期望能从奖牌数的变化中观察到其影响。本研究旨在通过数学建模,基于奥运会奖牌数据,预测和评估“伟大教练”效应。
In sports competitions, the replacement of coaches can have a significant impact on athletes' performance, especially when the new coach is referred to as a "great coach." We expect to observe the impact from the change in the number of medals. This study aims to predict and evaluate the "great coach" effect through mathematical modeling, based on Olympic medal data.

7.1伟大教练效应预测与量化模型的构建
7.1 Construction of a Prediction and Quantification Model for the Great Coach Effect

7.1.1数据处理
7.1.1 Data processing

我们分析的数据是从题目所给数据集中筛出的每届奥运会某个国家在特定运动项目中的奖牌统计,具体包括金牌、银牌、铜牌和总奖牌数以及该国家在当届奥运会是否为东道主。
The data we analyzed is the medal statistics of a country in a specific sport event at each Olympic Games, extracted from the given dataset, including gold medals, silver medals, bronze medals, and the total number of medals, as well as whether the country was the host of that Olympic Games.

Wt(奖牌得分) 为第 t 年(奥运会年份)该国家在某一运动项目中获得的加权奖牌数,我们以此来表征。假设奖牌种类(即金、银、铜)对国家表现的贡献不同,因此我们为每种奖牌分配不同的权重,其中:金牌权重为3、银牌权重为1.5、铜牌权重为1。则第 t 年的加权奖牌数 Wt​ 表示为:
Wt (medal score) represents the weighted number of medals won by a country in a particular sport event in the tth year (Olympic year), which we use to characterize. Assuming that different types of medals (i.e., gold, silver, bronze) contribute differently to the country's performance, we assign different weights to each medal, where: the weight of gold medal is 3, the weight of silver medal is 1.5, and the weight of bronze medal is 1. Then, the weighted number of medals Wt in the tth year is expressed as:

Wt​=3⋅Gold​+1.5⋅Silvert​+1⋅Bronzet​
Wt = 3⋅Gold + 1.5⋅Silvert + 1⋅Bronzet

其中,Gold、Silver、和 Bronze,分别表示第 t 年该国家获得的金牌、银牌和铜牌数量。
Among them, Gold, Silver, and Bronze represent the number of gold, silver, and bronze medals won by the country in the t-th year, respectively.

最后,我们对所取数据处理后的每一个事件都包含一个时间点(年份)、该年该国家在该运动项目中的奖牌得分以及东道主情况。
Finally, each event processed from the data includes a time point (year), the medal score of the country in that sport for that year, and the host situation.

7.1.2突发性增长筛选
Sudden growth screening

我们定下了如下筛选条件:
We have set the following screening criteria:

· 前一届数据作为对比增长率超过200%或奖牌数增加>2
· Compared with the previous session, the growth rate exceeds 200% or the number of medals increases by more than 2

· 非东道主国家
Non-host country

· 非首次获奖
Not the first time to win

若满足条件则被筛选为候选突发性增长事件
If conditions are met, it will be selected as a candidate for an abrupt growth event

在对检测结果的统计验证上,各个项目我们使用置换检验(Permutation Test)假设奖牌增长与时间无关,然后随机打乱该国该项目奖牌数的年份顺序,计算随机情况下出现同样增长的概率。若p<0.05,则认为该候选事件较为显著。
Statistical verification of detection results: for each item, we use permutation test (Permutation Test), assuming that the increase in medals is unrelated to time, then randomly shuffle the year order of the country's medals in this item, and calculate the probability of the same increase occurring in a random situation. If p < 0.05, it is considered that the candidate event is relatively significant.

我们总共发现1109个候选突发性增长事件其中108个事件统计显著(p<0.05)回归分析结果R方值为0.932接近1,说明模型拟合度很高F统计量为1151,p值为0.0369,说明模型整体显著
We found a total of 1109 candidate sudden growth events, of which 108 events were statistically significant (p<0.05). The regression analysis result shows that the R-squared value is 0.932, close to 1, indicating a high degree of model fit. The F-statistic is 1151, and the p-value is 0.0369, indicating that the model is overall significant.

7.1.2规律探讨与模型构建
7.1.2 Discussion on Regularity and Model Construction

我们对筛选出的突发性增长事件的增长率分别进行了计算与排序,找到了增长率最高的20个不同国家的项目,如下图所示
We have calculated and sorted the growth rates of the selected sudden growth events, and found the projects from the top 20 different countries with the highest growth rates, as shown in the figure below

但我们发现数据不够合理,因为多数为Hockey以及Handball等项目因此我们根据具体的项目的参赛人数及相应奖牌数来对增长率进行相应的数学调整得到了新的排序图:
But we found that the data is not reasonable enough, as most of it is for sports such as Hockey and Handball. Therefore, we adjusted the growth rate mathematically according to the number of participants and the corresponding number of medals for each specific project, and obtained a new ranking chart:

通过对资料的搜集结合题目中所给出的美国与中国排球的例子,我们可以看到,奖牌增长率最高的20个项目中,有许多项目是由于教练的因素造成,包括USA-Volleyball-2008和CHN-Volleyball-2016增长较高,均是在郎平教练开始执教之后发生[11][12]以及ESP-Football-2020,也是在2018年西班牙更换主教练之后发生的增长[13]因此,我们认为奖牌的增长率越高,这种改变是由伟大教练效应造成的可能性就越高,即项目奖牌数极高的增长率可以当做伟大教练效应出现的证据。基于此,我们构建出一套针对于特定国家的模型,来预测下一届奥运会伟大教练效应可能会出现在哪些项目上,并且量化出其影响。
Through the collection of data combined with the examples of American and Chinese volleyball given in the topic, we can see that many of the 20 projects with the highest medal growth rate are due to the factors of coaches, including the higher growth of USA-Volleyball-2008 and CHN-Volleyball-2016 after Lang Ping began coaching [11][12], as well as ESP-Football-2020, which also occurred after the Spanish team changed coaches in 2018 [13]. Therefore, the higher the growth rate of medals, the higher the possibility that this change is caused by the great coach effect, that is, the extremely high growth rate of project medals can be taken as evidence of the great coach effect. Based on this, we have developed a model for specific countries to predict which projects the great coach effect may appear in the next Olympic Games and quantify its impact.

对于伟大教练效应的预测,首先对所选择的国家在下一届奥运会中各个项目所获的奖牌数进行预测,这里我们采用灰色预测模型(GM(1,1))
For the prediction of the great coach effect, first predict the number of medals won by the selected countries in various events at the next Olympic Games. Here, we adopt the grey prediction model (GM(1,1))

对于伟大教练效应的量化,为了简化计算,我们忽略掉其他外部因素影响,定义出伟大教练效应产生影响的一种量化标准:我们认为对于某个国家的所有项目而言,增长率最高的项目所对应的奖牌增长率就是伟大教练效应产生的奖牌增长率。
For the quantification of the great coach effect, in order to simplify the calculation, we ignore the influence of other external factors and define a quantitative standard for the impact of the great coach effect: We believe that for all projects of a country, the growth rate of the project with the highest growth rate corresponds to the growth rate of the medals produced by the great coach effect.

7.2国家预测与效应量化
7.2 National Forecast and Quantification of Effects

在这里,我们选取中国,法国与日本这三个国家。通过上述模型我们代入我们。
Here, we select China, France, and Japan as the three countries. Through the above model, we substitute ourselves.

8..模型展示的原创性见解
Originality insights demonstrated by the model

9.Sensitivity Analysis
9. Sensitivity Analysis

为了探究我们模型对于可能存在的噪声数据的敏感程度,我们在训练集和测试集中按照0.0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50依次递增的噪声水平添加噪声数据,并观察其对随机森林分类器和随机森林回归器的影响,对分类器的评判指标是AUC分数,对回归器的评判指标为RMSE。通过结果我们发现,随机森林分类器的AUC得分虽然在噪声水平超过0.25时相对于之前有明显下降,但得分也一直在0.994以上,可得我们模型对噪声不敏感,具有较强的抗噪声能力。
To explore the sensitivity of our model to potentially existing noisy data, we added noise data to the training set and test set at noise levels of 0.0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50 in ascending order, and observed their impact on the random forest classifier and random forest regressor. The evaluation index for the classifier is the AUC score, and for the regressor, it is the RMSE. The results show that although the AUC score of the random forest classifier decreased significantly when the noise level exceeded 0.25 compared to before, the score has always been above 0.994, indicating that our model is not sensitive to noise and has strong noise resistance.

10.Model Evaluation
10. Model Evaluation

11.Conclusion
11. Conclusion

参考文献
References

[1] Rathke, A., & Woitek, U. (2007). Economics and Olympics: An efficiency analysis.

[2] Donald, W. B. (1972). Olympic games competition: structural correlates of national success. International Journal of Comparative Sociology, 13, 186.

[3] Condon, E. M., Golden, B. L., & Wasil, E. A. (1999). Predicting the success of nations at the Summer Olympics using neural networks. Computers & Operations Research, 26(13), 1243-1265.

[4] Tcha, M., & Pershin, V. (2003). Reconsidering performance at the Summer Olympics and revealed comparative advantage. Journal of Sports economics, 4(3), 216-239.

[5] Forrest, D., McHale, I. G., Sanz, I., & Tena, J. D. (2015). Determinants of national medals totals at the summer Olympic Games: an analysis disaggregated by sport. In The economics of competitive sports (pp. 166-184). Edward Elgar Publishing.
[5] Forrest, D., McHale, I. G., Sanz, I., & Tena, J. D. (2015). Determinants of national medal totals at the summer Olympic Games: an analysis disaggregated by sport. In The economics of competitive sports (pp. 166-184). Edward Elgar Publishing.

[6] Bernard, A. B., & Busse, M. R. (2004). Who wins the Olympic Games: Economic resources and medal totals. Review of economics and statistics, 86(1), 413-417.

[7] Rewilak, J. (2021). The (non) determinants of Olympic success. Journal of sports economics, 22(5), 546-570.

[8] Scelles, N., Andreff, W., Bonnal, L., Andreff, M., & Favard, P. (2020). Forecasting national medal totals at the Summer Olympic Games reconsidered. Social science quarterly, 101(2), 697-711.
[8] Scelles, N., Andreff, W., Bonnal, L., Andreff, M., & Favard, P. (2020). Reconsidering the Forecasting of National Medal Totals at the Summer Olympic Games. Social Science Quarterly, 101(2), 697-711.

[9] Krishnan, S., Vasan, R. V., Varma, P., & Mala, T. (2022, December). Probabilistic Forecasting of the Winning IPL Team Using Supervised Machine Learning. In International Conference on Advanced Network Technologies and Intelligent Computing (pp. 138-152). Cham: Springer Nature Switzerland.

[10] Lessmann, S., Sung, M. C., & Johnson, J. E. (2010). Alternative methods of predicting competitive events: An application in horserace betting markets. International Journal of Forecasting, 26(3), 518-536.
[10] Lessmann, S., Sung, M. C., & Johnson, J. E. (2010). Alternative methods for predicting competitive events: An application in horserace betting markets. International Journal of Forecasting, 26(3), 518-536.

[11] China Daily. (2005, February 8). China's economic growth in 2004 hits 9.5 percent. China Daily. https://www.chinadaily.com.cn/english/doc/2005-02/08/content_415939.htm
[11] China Daily. (2005, February 8). China's economic growth in 2004 reaches 9.5 percent. China Daily. https://www.chinadaily.com.cn/english/doc/2005-02/08/content_415939.htm

[12] China.org.cn. (2013, April 25). China's soccer coach hopeful about World Cup qualification. China.org.cn. http://www.china.org.cn/sports/2013-04/25/content_28655006.htm

[13] FIFA. (2018, July 9). Luis Enrique appointed Spain coach. FIFA.com. https://inside.fifa.com/en/news/luis-enrique-appointed-spain-coach
[13] FIFA. (2018, July 9). Luis Enrique named Spain coach. FIFA.com. https://inside.fifa.com/en/news/luis-enrique-appointed-spain-coach