Introduction
Problem Background
During the 2024 Paris Olympic Games, the audience closely followed individual events and the medal table rankings of various countries. Countries at the top of the medal standings always receive more attention, such as the United States, which leads with 126 medals. At the same time, other medal-winning countries are also worthy of our respect, such as Albania and Cape Verde, which have achieved historic breakthroughs. However, it is noteworthy that more than 60 countries have never won an Olympic medal. The number of Olympic medals is closely related to a country's economic and social development, sports education policies, and population size, among other factors. Predicting future Olympic medal tallies is significant; therefore, we will utilize mathematical models to analyze and address this issue.
Considering the background information and restricted conditions identified in the problem statement, we need to address the following issues:
Problem 1:Based on the model, predict the medal distribution for the 2028 Olympics and provide a prediction interval. Which countries are likely to win more medals? And which countries might perform worse? What is the likelihood and probability of countries that have never won a medal securing their first medal in the 2028 Olympics? Analyze the number and type of Olympic events and their relationship with the number of medals. Which sports are closely related to certain countries, and how does the host's choice of events affect the distribution of medals?
Problem 2: How great is the impact of the great coach effect on the number of medals, and select three countries, what are their key investments in coaching projects, and what changes will there be in the number of medals of these three countries under the influence of the coach?
Problem 3: What other factors might affect the distribution of the number of medals?
Our Work
Figure 1: Our Work
Assumptions and Justifications
Assumption 1: Athletes will not change their nationality.
Explanation: If an athlete has the potential to win a medal and they change their nationality, it would result in a decrease in the medal count for one country and an increase for another country.
Assumption 2: The stability of historical data patterns .
The patterns presented by historical data remain stable in the future; that is, the factors that influenced a country's first medal win in the past will still be effective in the 2028 Olympics.
Assumption 3: High medal efficiency coupled with a high event participation rate reflects an elite coach strategy plus broad participation. High medal efficiency but a low event participation rate suggests a possible reliance on an elite strategy dominated by a few top coaches. Low medal efficiency indicates a lack of coaching resources or insufficient project investment.
Explanation: This approach simplifies the problem by directly linking the impact of coaches to the number of medals.
Notations
The key mathematical notations used in this paper are listed in Table 1.
Notations used in this paper
Symbol | Description |
| Instruction function
|
| Year
|
card | The cardinality of a set
|
| The country in the year
Number of athletes |
| Initial prediction model output (starting value) |
| Target variable, the
The total number of medals of a country |
c | Current model's constant prediction value
|
| Loss function, used to measure the predicted value of the model
Deviation from the true value |
N | Sample size, that is, the total number of countries participating in the Olympics.
|
| Currently, the model is processing the first
A country's predicted value |
| Loss function
Partial derivative of the model prediction value |
| Model's error in predicting the number of medals for a certain country
|
| The 20th century was a time of great change and progress
A tree, fitting the regression tree for the current model residuals |
| Current model's predicted value (including all previous) The cumulative contribution of a tree) |
| Total number of iterations, i.e., the total number of decision trees
|
| Prevent small constant from causing denominator to be zero
|
No translation needed | Prevent small constant from causing denominator to be zero
|
| Total medals
|
| Number of participating athletes
|
| Number of events
|
| Competition participation rate
|
| Error term
|
| Regression coefficient, indicating the marginal contribution of the feature to the number of medals
|
Predicting Medal Counts for the 2028 Olympic Games
2028 Los Angeles Olympics medal table
Model selection: Model selection
Due to the fact that the prediction of Olympic medal counts typically involves multiple interacting variables, the relationships between these variables are often nonlinear. Random Forest is capable of capturing complex nonlinear relationships in the data by constructing multiple decision trees and integrating their predictions, which can better simulate these intricate interactions. When predicting Olympic medal counts, factors such as a country's GDP, population, and distance from the host country need to be considered. Random Forest can handle datasets with a large number of features, and these features can be continuous, categorical, or a mix of types. The model also reduces the risk of overfitting by introducing randomness in the tree-building process (for example, by randomly selecting subsets of features). Finally, Random Forest provides an assessment of feature importance, which helps to understand which factors are most critical for predicting medal counts, thus allowing for a deeper analysis of the key factors affecting medal acquisition. Therefore, we adopt the Random Forest model to predict the medal counts for the 2028 Olympics.
Data processing
Firstly, filter and clean the data, check for errors, outliers, or duplicate records in the data, and make corresponding corrections or deletions. For time series with missing values, interpolation methods (mean value), mean filling, or other appropriate methods (linear interpolation) can be used to fill in order to ensure the continuity of the data. Consider the table 1 and data dictionary as well as directly interpolate missing values with 0. Use unit root tests (ADF test, etc.), KPSS test, and other methods to test the stationarity of each time series. (This outlier can exist because some countries indeed receive more), so this can be omitted. If the data is not stationary, it may be necessary to differentiate or transform it to make it a stationary series to avoid false regression problems. Next, select core data from the cleaned dataset, using the year and athlete information in these data as feature variables, and gold, silver, bronze medals, and the total number as target variables. Next, preprocess the feature variables, considering that the year and medal count are numerical features, and the country and gold, silver, and bronze medals are categorical features, and handle them separately using standardization and one-hot encoding.
Model establishment
Using the above data, a random forest regression model is constructed, setting a random seed to ensure reproducibility, and determining the number of trees to be 100. Combine data preprocessing and model into a pipeline, and then divide the features and target variables into training and test sets, where the test set accounts for 20%. Finally, train the pipeline with the training set to complete the establishment of the random forest model.
Model calculation
After the model is established, the prediction calculation begins. First, generate data from 1896 to 2024 for all countries, and then use the trained pipeline to predict these future data, obtaining the predicted results of gold, silver, bronze, and the total number of medals for each country in the coming years. As shown in Table 1, it is the predicted number of medals for some countries in 2028 and the number of medals in 2024 and their change values. From the table, it can be seen that compared with the 2024 Olympic Games, the number of medals for countries such as Serbia in the 2028 Olympic Games is expected to increase by about 2 medals, the United States may decrease by 6 medals, Japan may decrease by 4 medals, Australia and other countries may decrease by 2 to 3 medals, while South Korea, Spain and other countries are expected to remain unchanged. Figure 3 is the corresponding visual comparison chart.Through the chart, the comparison of the top countries in the Olympic Games awards before 2024 and 2028 in terms of gold, silver, bronze medals, and the total number of medals can be made
Table1: 2024--2028 Prediction and Comparison of Gold, Silver, and Bronze Medalss(some countries)
|
National medical issues
| 2024 Gold | 2028 Gold
| 2024 Silver | 2028 Silver
| 2024 Bronze | 2028 Bronze | 2024 Total | 2028 Total |
United States
| 40 | 46 | 44 | 38 | 42 | 37 | 126 | 124 |
China | 40 | 36 | 27 | 29 | 24 | 28 | 91 | 94 |
Japan | 20 | 12 | 12 | 11 | 13 | 18 | 45 | 42 |
Australia | 18 | 17 | 19 | 15 | 16 | 18 | 53 | 51 |
France | 16 | 25 | 26 | 17 | 22 | 21 | 64 | 64 |
Netherlands | 15 | 11 | 7 | 10 | 12 | 11 | 34 | 33 |
Great Britain | 14 | 26 | 22 | 17 | 29 | 22 | 65 | 64 |
Chart 3: Comparison of medal counts for the top five countries in 2024 and 2028
Through the analysis chart, it was found that the predicted values of the United States for gold, silver, and bronze medals are all in first place, showing its comprehensive advantage in the Olympics. The predicted values of gold, silver, and bronze medals are all in second place, following the United States, showing its strength as a sports powerhouse. The United Kingdom, France, and Australia perform stably in the medal predictions, ranking in the top five. Japan, Italy, the Netherlands, Germany, and South Korea are relatively close in medal predictions, showing the competitiveness of these countries in the Olympics. Analyzing the trend of medal distribution, gold medals are concentrated in a few countries (such as the United States and China), while the distribution of silver and bronze medals is relatively balanced.
Figure 2: 2028 Olympic Gold, Silver, and Bronze Medal Top Ten Countries Forecast
The relationship between project settings and medal counts.
4.2. Data processing
On the 1896 - 2024 year Summer Olympic Games athlete participation records data, covering athlete names (Name), gender (Sex), team (Team), country code (NOC), participation year (Year), host city (City), event (Sport), specific event (Event), and medal (Medal) fields. Clean the medal data, converting the Medal field into a binary variable, with the rule that if a gold/silver/bronze medal is obtained, the Medal is assigned a value of 1; other cases (including missing values and not winning) are assigned a value of 0. At the same time, delete records with missing values or format errors in the dataset to ensure the accuracy and integrity of the data. Group the cleaned data by country code (NOC) and year (Year).Count the number of unique athletes in each group (Athletes,obtained by counting the unique values in the Name field)、the number of participating events (Sports,obtained by counting the unique values in the Sport field) and whether they have won a medal (Medal,taking the maximum value of the Medal field in each group to determine if the country won a medal in that year). For example, the statistics for the AFG country in the 1936 year show that there were 15 athletes, participated in 1 event, and did not win a medal (Medal was 0); in the 1948 year, there were 22 athletes, participated in 2 events, and still did not win a medal, etc.
Model evaluation
We estimate the uncertainty of Olympic medal predictions based on the quantile regression random forest algorithm. By constructing a model of 500 decision trees, we determine the prediction interval by referring to the quantile distribution (5th, 25th, 50th, 75th, 95th) of the prediction results of each subtree, achieving quantification of uncertainty in three aspects: (1) Median prediction: As a point estimate, reflecting the most likely medal distribution. (2) Interquartile range (IQR): Reflecting the central tendency of the middle 50% of prediction values. (3) Prediction interval (PI): Composed of the 5th to 95th percentiles to form a 90% confidence interval.
1) Model evaluation
As shown in Table 2, the model demonstrates good predictive ability on the test set: (1) Comprehensive prediction accuracy: The mean squared error (MSE) is 23.45, and the mean absolute error (MAE) is 3.78 medals. (2) Explanatory power: The coefficient of determination R² reaches 0.92, indicating that the model can explain 92% of the variation in the number of medals. (3) Interval coverage: The overall prediction interval coverage rate is 91.23%, close to the theoretical value of 90%.
Table 3: Performance Metrics for Medal Type Predictions at the 2028 Olympic Games
Media Type | MSE | MAE | R² | Prediction Interval Width | Coverage Rate (%) |
Gold | 23.45 | 3.78 | 0.92 | 13.9 | 89.47% |
Silver | 18.32 | 3.12 | 0.89 | 11.7 | 92.31% |
Bronze | 15.67 | 2.95 | 0.91 | 12.3 | 91.89% |
2) Predictive Interval Feature Analysis
Predictive interval distribution characteristics of typical countries (taking China as an example):
Gold Medal Prediction: The maximum interval width is (13.9 pieces), with a median forecast of 22.3 pieces (IQR = 7.3), and a 95% confidence interval of [15.2, 29.1], reflecting the strong uncertainty of top-level competitive sports.
Silver medal prediction: The highest interval coverage rate (92.31%), with a median prediction error of 18.9 pieces (MAE = 3.12), is significantly lower than the historical fluctuation level.
Bronze prediction: IQR is 6.3, with a right-skewed distribution (75th - 95th interquartile range 9.4 vs. 5th - 25th range 2.6).
3)Discussion on Sources of Uncertainty
Predictive interval distribution characteristics of typical countries (taking China as an example):
Gold Medal Prediction: The maximum interval width is (13.9 pieces), with a median forecast of 22.3 pieces (IQR = 7.3), and a 95% confidence interval of [15.2, 29.1], reflecting the strong uncertainty of top-level competitive sports.
Silver medal prediction: The highest interval coverage rate (92.31%), with a median prediction error of 18.9 pieces (MAE = 3.12), is significantly lower than the historical fluctuation level.
Bronze prediction: IQR is 6.3, with a right-skewed distribution (75th - 95th interquartile range 9.4 vs. 5th - 25th range 2.6).
4.2. 2 Build feature engineering
Using the target year as the prediction point, extract historical data up to T - 4 years (such as 2024 year), and generate the following features:
1.Number of Participations (Participations): The total number of times the country has participated before the end of the year, the calculation formula is:
Competing in the yearY
)
Among them,
To indicate a function, if NOC
In the year
Compete
Competing in the yearY=1),
Otherwise 0,
2.Total Number of Athletes (AthletesTotal)The total number of athletes from a country in all participating years up to the cutoff year, calculated by the formula:
,
Representing the country in the year
Number of athletes.
3.Project Diversity (SportsDiversity):The number of different sports projects a country participated in before the end of the year, calculated by the cardinality of the set of projects participated in different years,
That is to say:
The "card" represents the cardinality of the set, that is, the number of elements in the set.
4. Number of athletes in recent years (Recent Athletes)Athletesfrom the national team in the year ending (T-4), namely:
。
4.2.3 Model Construction, Training, and Solution
Firstly, the data is divided and balanced. The training set only contains data from countries that have never won medals (Medal History = 0). The ADASYN (Adaptive Synthetic Sampling Approach) algorithm has been used for over-sampling on the training set, which synthesizes minority class samples (countries that win for the first time) by increasing their weights, effectively alleviating the data imbalance problem and enabling the model to better learn the features of minority class samples during training. Then, the random forest classifier (Random Forest) is selected as the prediction model. The parameter settings are as follows: the number of decision trees is set to 200, aiming to balance computational efficiency and model complexity; the class weight is set to negative class weight 1, positive class weight 5, to reduce the risk of overfitting by adjusting the class weight; the maximum tree depth is set to 5, the minimum number of samples in leaf nodes is set to 10, thereby enhancing the generalization ability of the model, so that the model can perform well on different datasets. Then, the dynamic threshold is adjusted based on the precision-recall curve of the test set, and the optimal threshold that satisfies recall rate ≥ 15% and precision rate ≥ 25% is selected. If there is no threshold that meets the above conditions after calculating the precision-recall curve, then the 85th percentile of the prediction probability is taken as the threshold.By adjusting dynamic thresholds, optimizing the model's predictive performance, and achieving a better balance when predicting the first-time winning situation of countries that have never won awards.Exclude blacklisted countries (such as SSD, LBN) from the predefined list of countries that have never won awards (NON_MEDAL_COUNTRIES), and apply business rules for filtering: requiring the number of participations (Participations) ≥1, recent number of athletes (RecentAthletes) ≥6, and project diversity (SportsDiversity) ≥2, to filter out candidate countries that meet the conditions.
Extract the features defined in the feature engineering stage from the filtered candidate countries, input these features into the trained random forest model, and the model outputs the probability of each candidate country winning the first time.
Through the construction and training of the model, we predict that the probability of the first winning in the 2028 Olympics exceeds 70% for the following countries: P(ANG) = 0.856079, P(NFL) = 0.855923, P(UNK) = 0.750016, P(PLE) = 0.709846, P(TUV) = 0.709559, P(GBS) = 0.706677.
Graph 4: First-time Winning Probability for Non-Winning Countries
4.2.4 Model Evaluation
Model's AUC-ROC value is 0.89, indicating that the model has strong discriminative ability and can better distinguish whether a country that has not won the award will win for the first time in the future. At the optimal threshold, the accuracy rate is 32%, the recall rate is 28%, reflecting the model's accuracy and completeness in predicting the first-time winning countries. Moreover, through cross-validation, the model performs stably on the test set, with a standard deviation less than 0.03, indicating that the model has good robustness and can maintain relatively stable performance on different data subsets.
The relationship between project settings and the number of medals
Establishment of the model
Through the Hist Gradient Boosting Regressor model, quantify the relationship between competition projects and medal distribution. Gradient boosting is an ensemble learning method based on decision trees, with the core idea of improving the predictive performance of the model by gradually fitting the error.
After iterating M times, the final model is:
Among them,
The final prediction model, including all
The cumulative contribution of a tree.
Total number of iterations, i.e., decision tree
The total quantity.·
The final model is the weighted sum of all weak predictors (trees) and can capture the relationship between complex features and the target variable (number of medals). For example, it can simultaneously explain the host country effect (whether it hosts), the number and type of sports events (such as track and field, swimming, etc.), and their impact on the distribution of medals.
Introduce feature variables
Is HostThe Host Country Symbol, used to measure the home advantage.Host Chosen SportsThe number of sports chosen by the host country, reflecting the host country's ability to influence the number of medals through the addition of new events.Medal EfficiencyThe ratio of the number of medals to the number of events, used to measure the training and selection efficiency of a country.EventsThe total number of events, used to describe the overall distribution of medal opportunities.Individual Event Characteristics(such as athletics, swimming, etc.): Measure the contribution of a country's performance in a specific event to the total number of medals.TotalRepresents the total number of medals won by each country at a particular Olympic Games.
Based on the optimized results, organize the feature importance into a table and sort it by importance
Table 4 Feature Names and Their Importance
Feature name
| Feature Importance (Importance)
|
Is Host | 0.467212 |
Host Chosen Sports | 0.267642 |
Medal Efficiency | 0.185143 |
Events | 0.045841 |
Athletics | 0.016510 |
Swimming | 0.014768 |
Gymnastics | 0.001607 |
Basketball | 0.000673 |
Football | 0.000604 |
Model solution and answer
According to the data in Table 4, the ranking of feature importance is:
Is Host (Host Country): The feature importance is the highest, about 46.7%. The host country effect is the most critical factor affecting the total number of medals, which may be due to: Home advantage: athletes are familiar with the venue, and spectators support. The host country has a significant advantage in its strong event. Judge bias: In subjective scoring events (such as gymnastics, diving, etc.), the host country's athletes often have an advantage.
Chosen Sports HostThe number of sports chosen by the host country: The second most important feature, accounting for26.8%. The host country can increase the number of medals by adding or promoting sports events. Typically, the host country tends to choose events in which it excels,which significantlyboosts the host country's medal tally.
Medal Efficiency(Medal Efficiency): Ranked third, the feature importance is18.5%. This feature reflects the efficiency of athletes from various countries in participating in competitions (the ratio of medal count to the number of events participated in). Countries with higher medal efficiency usually invest more resources in a few advantageous events.
EventsNumber of events: The importance of feature is only4.6%. The total number of events has a certain impact on the total number of medals, but the effect is relatively indirect, and it may be necessary to combine other features (such as the host country or the number of athletes)to play a significant role.
Single sports event (such as track and field, swimming): The importance of a single event is relatively low (less than 2%), which may be because these events contribute evenly to the total number of medals.
Regarding the issue of "the impact of sports events on medal distribution," from the perspective of important sports events: Although the importance of individual events is relatively low, sports with high medal density such as swimming and track and field (respectively about 35 and 48 events) still make a significant basic contribution to the total number of medals. For example, in track and field, almost all countries will invest resources in this field because the number of medals is the highest. In swimming, the distribution of medals is relatively wide, but some developed countries (such as the United States and Australia) have significant advantages. In the gymnastics event, although the number of medals is small, the subjective scoring of judges is common, and the host country's athletes often have an advantage. From the perspective of the number of sports events, the more sports events there are, the greater the opportunity for countries to win medals. The host country often promotes the addition of new events, and this strategy has a significant impact on the distribution of medal numbers. For example, in the 2000 Sydney Olympics, women's weightlifting was added, and Australia performed outstandingly in this event.
Regarding the question of "which sports events are most important for various countries," we have conducted a focused analysis of China, the United States, and Japan, and have drawn the following conclusions: The United States: The United States has long dominated in high-medal density events such as track and field and swimming, thanks to its strong sports infrastructure and extensive training system. China: China has a high gold medal efficiency in a few sports such as diving, table tennis, and gymnastics. These projects have concentrated resources, and the athletes have maintained a leading level of technical skill for a long time. Japan: Japan has a historical advantage in traditional sports such as judo and increased its total medal count in the 2020 home Olympics by adding the sport of karate.
Regarding the question of how the host country's choice of competition events affects the results, we found that the host country enhances its medal share in strong events by adding new competition events. For example, Japan performed strongly in the newly added sports of rock climbing and judo at the 2020 Tokyo Olympics, significantly increasing the number of medals with the advantage of home ground. The host country has a natural advantage in scoring events for judges (such as gymnastics and diving), and adding these events contributes greatly to the host country's medal count.
Great Coach Effect on the Impact of Medal Numbers
Establishment of the model
Dual indicator method
We first adopt the dual indicator method to analyze the medal efficiency (Medal Efficiency, ME) and event participation rate (Event Participation Rate, EPR). Assuming high medal efficiency and high event participation rate reflect elite coach strategies + extensive participation. High medal efficiency but low event participation rate may depend on the elite strategy dominated by a few top coaches.
While the low medal efficiency is due to a lack of coaching resources or insufficient project investment.
1.Medal efficiency calculation formula:
Medals: The number of medals won by a country in a certain event. Athletes Count: The number of athletes participating. A small constant to prevent division by zero.
Prevent small constant from causing denominator to be zero(
For example
2.Calculation formula for event participation rate:
Among them, Event Count: the total number of events corresponding to the project.
Prevent small constant with denominator zero.
Multivariate regression model
Establish a regression model and quantify the marginal contribution of the coaching effect. The multivariate linear regression model is as follows:
Among them,
Total medals
(Medals)
Number of participating athletes
Athletes Count) .
Number of events
Event Count)。
Competition participation rate
Event Participation Rate) .
Regression coefficient, indicating the marginal contribution of the feature to the number of medals
Residual term. Coefficients in the regression model,
We can judge
Competition participation rate
The impact on the number of medals. Marginal contribution:
The participation rate of the event increases
Translation: one
Unit, contribution to the number of medals。
Model solution
We take the United States, Japan, and Cuba as the research objects, and the coefficient of Athletes Count can be obtained by the dual indicator method and the multiple linear regression model, which is 0.1571, meaning that for every increase of 1 in the number of participating athletes, the average number of medals increases by 0.157. The coefficient of Event Participation Rate is -0.0009, meaning that for every 1% increase in the event participation rate, the number of medals will decrease slightly. This indicates that highly elite projects may be more suitable for relying on excellent coaches. This pattern is particularly evident in high-efficiency projects. By using data from high-medal efficiency projects and countries, the contribution of the coach effect can be indirectly estimated:
(1)USA Basketball:Award Efficiency0.999999434indicates nearly perfect scores. A high event participation rate51.74%indicates a combination of widespread participation and elite coaching. If the event participation rate increases 10%and efficiency remains unchanged, based on regression coefficient estimation: the number of awards will increase by approximately0.81medals.
(2)Japanese baseball and softball:The medal efficiency is close to perfect, but the participation rate in the events is slightly lower than that of the United States. If more top coaches are introduced, it can maintain a high efficiency in winning medals.
(3)Cuban baseball:The participation rate in the event is only22.40%, and if the participation rate in the event is increased to30%by increasing the coach resources, the number of medals can be estimated to increase additionally based on the coefficient.
Combining the data from High Efficiency Projects and the results of the regression model, here are our investment recommendations for three countries:
(1) United States (USA): Enhanced Coaching Support for Basketball and Baseball
Reason: The United States has already achieved significant medal efficiency and high participation rate in basketball and baseball. Relying on the training of elite coaches can further consolidate its leading position in these sports.
Recommend: Investing in higher-level training facilities and coach education to strengthen long-term competitiveness.
Estimate the impact: On the basis of the current high efficiency, increase the participation rate of the event by 5% to 10%, and the expected number of medals will increase by 1 to 2.
(2) Japan (JPN): Expand participation in baseball and softball events
Reason: Japan's baseball and softball have top-level strength internationally, but the participation rate in events is relatively low compared to the United States. If more investment is made and more high-level coaches are introduced, their existing advantages can be fully utilized.
Recommend: Expand the domestic baseball competition system and attract excellent foreign coaches.
Estimating the Impact: If the participation rate increases by 10%, based on the regression coefficient calculation, the number of medals can increase by about 0.5 to 1.
(3) Cuba (CUB): Elite Expansion Baseball Program
Reason: Cuba's baseball team performs at an extremely high efficiency, but the participation rate in events is low, possibly relying on a small number of top players. Increasing the investment in coaching resources can expand the number of participants and enhance overall competitiveness.
Suggest: Introduce a foreign "great coach" for the baseball project and strengthen the domestic training system.
Estimate the impact: Increase the participation rate from 22.40% to 30%, and the expected number of medals is expected to increase by 1.
The effect of athletes' age on the number of medals
Data processing
First, collect relevant data containing age and total number of medals. The dataset contains a certain number of samples, reflecting the total number of medals corresponding to different ages. Check the integrity of the data to ensure there are no missing values. If there are missing values, use appropriate methods to fill them, such as mean filling method, median filling method, etc. Then, identify and handle possible outliers. Observe the distribution of data through methods such as box plots, scatter plots, etc., and correct or remove data points that deviate significantly from the normal range according to specific situations.
Model establishment and resolution
We first clean and filter the data, then analyze the data to obtain a stacked bar chart - Figure 5.
Graph 5, the relationship between the number of medals and the age of athletes
After importing the data, we use the Fourier series model to fit the data, to ensure accuracy and prevent overfitting, we adopt the standard model Fourier4, the formula is as follows:
Firstly, use the least squares method to estimate the parameters in the model
As well
The goal of the least squares method is to make the actual observed values (
The actual value of the total number of medals)
With the model prediction value
The sum of squared errors between them is minimized, that is:
Among
For the number of samples
For the
Translation: one
Sample age value
For the
Translation: one
The actual total number of medals of the sample.
Specific estimation process
Through numerical optimization algorithm (
As gradient descent method, trust region algorithm, etc)
To solve for
The smallest parameter value. The initial value of the given parameter in the process of solving(
Based on experience or preliminary analysis, set),
Then continuously adjust the parameter values until
Reach a minimum value or satisfy certain convergence conditions. After obtaining the estimated parameter values, calculate the parameters. 95%
Confidence interval. The confidence interval can reflect the degree of uncertainty in parameter estimation. Usually, the covariance matrix of parameter estimation is used to calculate the confidence interval, and the specific formula is based on relevant theories in statistics, such as for parameters(
Here
Waiting),
Translation: His/Her/Its95%
Confidence interval is,
Among
Estimation value for the parameter
Quantile for the standard normal distribution (
For95%
Confidence interval),
Standard error for parameter estimation. The final estimated parameter values and confidence intervals are as follows.
No translation needed
a0 = 157.5 (137.8, 177.2),a1=-106(-138.4,-73.57),b1=261.3(236.6,285.9),a2=-171.4(-201.5,-141.3),
b2 = -154.7(-184, -125.5),a3=150.8(122.1,179.6),b3=-109.3(-141.5,-77.08),
a4 = 64.08 (33.14, 95.03),
b4 = 108.9 (78.69, 139.1),w=0.07627(0.07503, 0.07751)
Againthe Fourier series model can be used to obtain the following data(Figure6)It can be seen fromthatbefore25years old, the number of medals increases with age, and after25years old, the number of medals decreases with age, most of the medalists are in the15~35age group, among whom25years old has the highest proportion of winners, and the number of winners decreases significantly after40years old, with a relatively low proportion.After50years old, almost no athletes have won medals.
Figure 6: Fourier Series Fitting Graph
Model evaluation
We adopt various methods to evaluate the model:
Sum of Squared Errors (SSE)
,
It measures the overall error size between the model's predicted values and the actual observed values. SSE
The smaller the value, the better the model fits the data
Coefficient of determination
,
Among
Average of the total number of medals.
The value range is 0
To 1
Between, the closer 1
The stronger the model's ability to explain the data, the better the fitting effect.
3. Adjusted R-squared
):
Considered the number of parameters in the model to avoid overfitting due to the addition of too many parameters.
The calculation formula is,
Among
For the number of samples
For the number of model parameters
4. Root Mean Square Error (RMSE)
,
It is SSE
The average square root can intuitively reflect the predicted value and the actual value
Average size of the error between values.
By calculation, we get:
,,,
From these evaluation indicators,
And
Mostly high, indicating that the model can explain most of the changes in the total number of medals, with a good fitting effect
Okay.
Root Mean Square Error
Reflects the average error between the predicted value and the actual value within a certain range. Overall, this Fourier series model is good at describing the relationship between age and award.
The relationship between the total number of cards is good.Sensitivity Analysis and Error Analysis
我们对第一个问题模型进行灵敏度分析: 1. Medal Efficiency(奖牌效率)扰动:输出结果为-1.938说明当我们扰动 奖牌效率 (Medal Efficiency) 的值(放大 20%)时,总奖牌数的预测结果发生了显著变化,减少了约 1.94 个奖牌。这一结果表明,奖牌效率对总奖牌数的预测具有重要影响。如果奖牌效率有所提高(例如运动员在比赛中的表现更优秀),则国家的奖牌数有可能大幅增加。换句话说,奖牌效率是影响总奖牌数预测的关键因素,可能与运动员的表现质量和训练密切相联。2.Event Participation Rate(赛事参与率)扰动:输出结果为0.0。当我们扰动赛事参与率 (Event Participation Rate) 的值时,总奖牌数的预测结果 没有变化。这说明赛事参与率对总奖牌数的预测并没有明显影响,这表明该特征在当前模型中对奖牌数的预测贡献较小。即使增加某个项目的参与人数,这也没有直接导致预测奖牌数的变化。这可能意味着参赛人数的增加与奖牌数并没有强相关性,或许其他因素(如运动员的整体实力和奖牌效率)对奖牌数的影响更大。3.Athletes Count(运动员数量)扰动:输出结果为0.0。同样地,当我们扰动 运动员数量 (Athletes Count) 的值时,总奖牌数的预测结果没有变化。这说明运动员数量对总奖牌数的预测影响较小。虽然直觉上,运动员人数增加可能带来更多的奖牌机会,但在当前模型中,运动员数量并没有显著影响总奖牌数。这可能是因为模型主要关注的是奖牌效率等特征,而不单纯是参赛人数。4. Total Events(比赛项目数量)扰动:输出结果为0.00093当我们扰动比赛项目数量 (Total Events)的值时,总奖牌数的预测结果发生了轻微的变化,增加了约0.00093 个奖牌。这说明比赛项目数量对总奖牌数的预测有非常小的影响。尽管比赛项目的增多可能为国家队提供更多的奖牌机会,但在该模型中,比赛项目数量并没有显著影响奖牌数。可能是因为奖牌数的增加更依赖于参赛运动员的质量和效率,而非单纯的项目数量。
We will analyze the third sub-question again: For each parameter, we select several values (for example, 5 values) at equal intervals within the 95% confidence interval to represent the variation of the parameter. For each variation value of each parameter, we substitute it into the Fourier series model and calculate the results within the given age range (assumed to be the interval from the minimum age to the maximum age in the dataset, for example)
From 10
To 97)
Output: Output of the internal model
Then compare the model output when estimating the parameter values, and calculate the relative change rate. The formula for calculating the relative change rate is:
Relative change rate
Among
Is the model output when the parameter takes a variable value
Is the model output when the parameter takes an estimated value.
The average change rate of the parameters is as follows
No translation needed Parameter
| Average relative change rate
|
| |
a1 | 110.04% |
b1 | 45.15% |
a2 | 91.11% |
b2 | 62.30% |
a3 | 97.97% |
b3 | 61.51% |
a4 | 101.52% |
b4 | 58.65% |
w | 80.09% |
From these results, we can infer that the average relative change rate of parameters a1 and a4 is relatively high, indicating that these two parameters have a significant impact on the model output (i.e., the predicted total number of medals)
)
The impact is relatively significant, and the model output is relatively sensitive to changes in these two parameters. b1
The average relative change rate is relatively low, indicating that its impact on the model output is relatively small. In practical applications, if you want to improve the accuracy of the model's predictions, it may be necessary to pay more attention to those parameters with a higher average relative change rate, ensuring that the estimates of these parameters are as accurate as possible.
7 Model Evaluation
Strengths
(1)Question1Weadoptedthe random forest model, which handles the complex nonlinear relationship between input features and target variables well. In the medal prediction problem, the influencing factors (such as medal efficiency, number of athletes, home effect, etc.) are not in a simple linear relationship with the number of medals, and the random forest can flexibly model these complex relationships.
(2)Random Forestcan also handle high-dimensional features and has strong robustness against noisy data and outliers. Even with irrelevant features, Random Forest can still effectively screen out useful features
(3)Question 3We adopt the Fourier series model,which fits various nonlinear relationships well,and handles complex data patterns.There is a nonlinear relationship between the age of athletes and the number of medals they win.The Fourier series can better describe this complex relationship.
(4)Fourier seriescanalso filter out high-frequency noise, focusing on the main data trends and patterns, which is particularly important forprocessingolympic medal data affected by various external factors.
Weaknesses
(1)Regarding issueone, the random forest model has a weak ability to model time series features and cannot directly capture the dynamic changes of the features
(2)Regarding question three, the Fourier series model may overfit the noise when fitting data, leading to a decrease in the model's generalization ability on new data. To avoid overfitting, appropriate regularization treatment of the model is required.
Conclusion
For the first question, we adopted the random forest model and gradient boosting decision tree to predict the number of medals for the 2028 Olympic Games. We found that China, Serbia, and other countries are expected to increase their medal count by about 2 medals, the United States may decrease by 6 medals, Japan may decrease by 4 medals, Australia and other countries may decrease by 2 to 3 medals, while South Korea, Spain, and other countries are expected to maintain the same number of medals. Palestine, Angola, and other countries that have never won Olympic medals have a probability of more than 70% of winning medals at the 2028 Olympics. Regarding the issue of the impact of Olympic events on the number of medals, although the importance of individual events is relatively low, the basic contribution of high-medal density events such as swimming and track and field to the total number of medals is still significant. The United States has long dominated high-medal density events such as track and field and swimming. Japan has a historical advantage in traditional events such as judo. China has a high gold medal efficiency in a few events such as diving, table tennis, and gymnastics.
Regarding the second question, we analyzed the great coach effect using a dual indicator method and a multivariate linear regression model, and recommend that the United States strengthen the coaching support for basketball and baseball, recommend that Japan expand the participation in baseball events, recommend that Cuba professionalize the expansion of the baseball program.
For the third question, we analyzed the impact of athletes' ages on the number of medals using the Fourier series model. The conclusion is: most medal-winning athletes are in the age range of 15 to 35, among whom those around 25 years old account for the largest proportion of winners, and the number of winners over 40 years old decreases significantly.
References
[1]Wu, D. T., & Wu, Y. (2008). The possibility of China surpassing the United States in gold medals at the 2008 Beijing Olympics: Analysis and prediction based on the host effect. Statistical Research, (03), 60-64.https://doi.org/10.19343/j.cnki.11-1302/c.2008.03.012
[2]Wang, F. (2019). Prediction of medal performance at the 2020 Olympics based on neural networks. Statistics and Decision, 35(05), 89-91. https://doi.org/10.13546/j.cnki.tjyjc.2019.05.019
[3] Lin, Y. P., & Wang, J. J. (2007). Predicting the number of medals at the 2008 Olympics using time series analysis. Journal of Nanjing Institute of Physical Education (Natural Science Edition), (01), 31-32.
[4]Wang, G. F., Xue, E. J., & Tang, X. F. (2010). Research on medal prediction in large-scale international comprehensive sports events: A case study of the Beijing Olympics. Journal of Tianjin University of Sport, 25(01), 86-90.https://doi.org/10.13297/j.cnki.issn1005-0000.2010.01.007
[]5]Wang, G. F., & Tang, X. F. (2009). Domestic and international research trends and development directions in Olympic medal predictionChina Sport Science and Technology, 45(06), 3-7+135. https://doi.org/10.16470/j.csst.2009.06.016