This is a bilingual snapshot page saved by the user at 2025-1-27 23:28 for https://app.immersivetranslate.com/word/, provided with bilingual support by Immersive Translate. Learn how to save?

Problem Chosen
C


202 202 5

MCM/ICM

Summary Sheet

Team Control Number
2513615

Your Paper's Title

Summary


The Olympic Games is a major event closely watched by people all over the world, and the rankings of the "medal table" of various countries also attract people's attention


Keywords: keyword1; keyword2; keyword3; keyword4

Team # 2513615 Page 2 of 11

Contents

1Introduction4

1.1Problem Background4


1.2 Our Work 5

2Assumptions and Justifications5

3Notations6


4Predicting Medal Numbers for the 2028 Olympic Games7


4.1 2028 Los Angeles Olympics medal table 7

4.1.1Model selection:7


4.1.2 Data Processing 7


4.1.3 Model Establishment 7


4.1.4 Model Calculation 8


4.1.5 Model Evaluation 11


4.2 The Relationship between Project Settings and Medal Counts. 12


4.2.1 Data Processing 12


4.2.2 Build Feature Engineering 13


4.2.3 Model Construction, Training, and Solution


4.2.4 Model Evaluation 14


4.3 The relationship between project settings and medal count


4.3.1 Model Establishment 14


4.3.2 Introducing Feature Variables 16


4.3.3 Model Solution and Answer 17


5The Impact of Great Coach Effect on the Number of Medals18


5.1Establishment of the Model18


5.1.1 Dual Indicator Method 18


5.1.2 Multivariate Regression Model 18


5.2 Model Solution 19


6The Impact of Athlete Age on the Number of Medals20


6.1 Data Processing 20


6.2 Model Establishment 20


6.3 Model Solution 21


6.4 Model Evaluation 22


7 Sensitivity Analysis and Error Analysis 23


7 Model Evaluation and Further Discussion 24


7.1 Strengths 24


7.2 Weaknesses 24


7.3 Further Discussion 24

8Conclusion24


References 24

Introduction

Problem Background


During the 2024 Paris Olympic Games, the audience closely followed individual events and the medal table rankings of various countries. Countries at the top of the medal standings always receive more attention, such as the United States, which leads with 126 medals. At the same time, other medal-winning countries are also worthy of our respect, such as Albania and Cape Verde, which have achieved historic breakthroughs. However, it is noteworthy that more than 60 countries have never won an Olympic medal. The number of Olympic medals is closely related to a country's economic and social development, sports education policies, and population size, among other factors. Predicting future Olympic medal tallies is significant; therefore, we will utilize mathematical models to analyze and address this issue.

Considering the background information and restricted conditions identified in the problem statement, we need to address the following issues:


Problem 1Based on the model, predict the medal distribution for the 2028 Olympics and provide a prediction interval. Which countries are likely to win more medals? And which countries might perform worse? What is the likelihood and probability of countries that have never won a medal securing their first medal in the 2028 Olympics? Analyze the number and type of Olympic events and their relationship with the number of medals. Which sports are closely related to certain countries, and how does the host's choice of events affect the distribution of medals?


Problem 2: How great is the impact of the great coach effect on the number of medals, and select three countries, what are their key investments in coaching projects, and what changes will there be in the number of medals of these three countries under the influence of the coach?


Problem 3: What other factors might affect the distribution of the number of medals?


Our Work


Figure 1: Our Work

Assumptions and Justifications

Assumption 1: Athletes will not change their nationality. 

Explanation: If an athlete has the potential to win a medal and they change their nationality, it would result in a decrease in the medal count for one country and an increase for another country.


Assumption 2: The stability of historical data patterns .


The patterns presented by historical data remain stable in the future; that is, the factors that influenced a country's first medal win in the past will still be effective in the 2028 Olympics.


Assumption 3: High medal efficiency coupled with a high event participation rate reflects an elite coach strategy plus broad participation. High medal efficiency but a low event participation rate suggests a possible reliance on an elite strategy dominated by a few top coaches. Low medal efficiency indicates a lack of coaching resources or insufficient project investment.


Explanation: This approach simplifies the problem by directly linking the impact of coaches to the number of medals.

Notations

The key mathematical notations used in this paper are listed in Table 1.

Notations used in this paper

Symbol

Description


Instruction function


Year

card


The cardinality of a set


The country in the year


Number of athletes


Initial prediction model output (starting value)


Target variable, the


The total number of medals of a country

c


Current model's constant prediction value


Loss function, used to measure the predicted value of the model


Deviation from the true value

N


Sample size, that is, the total number of countries participating in the Olympics.


Currently, the model is processing the first


A country's predicted value


Loss function


Partial derivative of the model prediction value


Model's error in predicting the number of medals for a certain country


The 20th century was a time of great change and progress


A tree, fitting the regression tree for the current model residuals


Current model's predicted value (including all previous)


The cumulative contribution of a tree)


Total number of iterations, i.e., the total number of decision trees


Prevent small constant from causing denominator to be zero


No translation needed


Prevent small constant from causing denominator to be zero


Total medals


Number of participating athletes


Number of events


Competition participation rate


Error term


Regression coefficient, indicating the marginal contribution of the feature to the number of medals

Predicting Medal Counts for the 2028 Olympic Games

2028 Los Angeles Olympics medal table


Model selection: Model selection


Due to the fact that the prediction of Olympic medal counts typically involves multiple interacting variables, the relationships between these variables are often nonlinear. Random Forest is capable of capturing complex nonlinear relationships in the data by constructing multiple decision trees and integrating their predictions, which can better simulate these intricate interactions. When predicting Olympic medal counts, factors such as a country's GDP, population, and distance from the host country need to be considered. Random Forest can handle datasets with a large number of features, and these features can be continuous, categorical, or a mix of types. The model also reduces the risk of overfitting by introducing randomness in the tree-building process (for example, by randomly selecting subsets of features). Finally, Random Forest provides an assessment of feature importance, which helps to understand which factors are most critical for predicting medal counts, thus allowing for a deeper analysis of the key factors affecting medal acquisition. Therefore, we adopt the Random Forest model to predict the medal counts for the 2028 Olympics.


Data processing


Firstly, filter and clean the data, check for errors, outliers, or duplicate records in the data, and make corresponding corrections or deletions. For time series with missing values, interpolation methods (mean value), mean filling, or other appropriate methods (linear interpolation) can be used to fill in order to ensure the continuity of the data. Consider the table 1 and data dictionary as well as directly interpolate missing values with 0. Use unit root tests (ADF test, etc.), KPSS test, and other methods to test the stationarity of each time series. (This outlier can exist because some countries indeed receive more), so this can be omitted. If the data is not stationary, it may be necessary to differentiate or transform it to make it a stationary series to avoid false regression problems. Next, select core data from the cleaned dataset, using the year and athlete information in these data as feature variables, and gold, silver, bronze medals, and the total number as target variables. Next, preprocess the feature variables, considering that the year and medal count are numerical features, and the country and gold, silver, and bronze medals are categorical features, and handle them separately using standardization and one-hot encoding.


Model establishment


Using the above data, a random forest regression model is constructed, setting a random seed to ensure reproducibility, and determining the number of trees to be 100. Combine data preprocessing and model into a pipeline, and then divide the features and target variables into training and test sets, where the test set accounts for 20%. Finally, train the pipeline with the training set to complete the establishment of the random forest model.


Model calculation


After the model is established, the prediction calculation begins. First, generate data from 1896 to 2024 for all countries, and then use the trained pipeline to predict these future data, obtaining the predicted results of gold, silver, bronze, and the total number of medals for each country in the coming years. As shown in Table 1, it is the predicted number of medals for some countries in 2028 and the number of medals in 2024 and their change values. From the table, it can be seen that compared with the 2024 Olympic Games, the number of medals for countries such as Serbia in the 2028 Olympic Games is expected to increase by about 2 medals, the United States may decrease by 6 medals, Japan may decrease by 4 medals, Australia and other countries may decrease by 2 to 3 medals, while South Korea, Spain and other countries are expected to remain unchanged. Figure 3 is the corresponding visual comparison chart.Through the chart, the comparison of the top countries in the Olympic Games awards before 2024 and 2028 in terms of gold, silver, bronze medals, and the total number of medals can be made


Table1: 2024--2028 Prediction and Comparison of Gold, Silver, and Bronze Medalsssome countries


National medical issues

2024

Gold


2028 Gold

2024

Silver


2028 Silver

2024

Bronze

2028

Bronze

2024

Total

2028

Total


United States

40

46

44

38

42

37

126

124

China

40

36

27

29

24

28

91

94

Japan

20

12

12

11

13

18

45

42

Australia

18

17

19

15

16

18

53

51

France

16

25

26

17

22

21

64

64

Netherlands

15

11

7

10

12

11

34

33

Great Britain

14

26

22

17

29

22

65

64


Chart 3: Comparison of medal counts for the top five countries in 2024 and 2028


Through the analysis chart, it was found that the predicted values of the United States for gold, silver, and bronze medals are all in first place, showing its comprehensive advantage in the Olympics. The predicted values of gold, silver, and bronze medals are all in second place, following the United States, showing its strength as a sports powerhouse. The United Kingdom, France, and Australia perform stably in the medal predictions, ranking in the top five. Japan, Italy, the Netherlands, Germany, and South Korea are relatively close in medal predictions, showing the competitiveness of these countries in the Olympics. Analyzing the trend of medal distribution, gold medals are concentrated in a few countries (such as the United States and China), while the distribution of silver and bronze medals is relatively balanced.


Figure 2: 2028 Olympic Gold, Silver, and Bronze Medal Top Ten Countries Forecast


The relationship between project settings and medal counts.


4.2. Data processing


On the 1896 - 2024 year Summer Olympic Games athlete participation records data, covering athlete names (Name), gender (Sex), team (Team), country code (NOC), participation year (Year), host city (City), event (Sport), specific event (Event), and medal (Medal) fields. Clean the medal data, converting the Medal field into a binary variable, with the rule that if a gold/silver/bronze medal is obtained, the Medal is assigned a value of 1; other cases (including missing values and not winning) are assigned a value of 0. At the same time, delete records with missing values or format errors in the dataset to ensure the accuracy and integrity of the data. Group the cleaned data by country code (NOC) and year (Year).Count the number of unique athletes in each group (Athletes,obtained by counting the unique values in the Name field)、the number of participating events (Sports,obtained by counting the unique values in the Sport field) and whether they have won a medal (Medal,taking the maximum value of the Medal field in each group to determine if the country won a medal in that year). For example, the statistics for the AFG country in the 1936 year show that there were 15 athletes, participated in 1 event, and did not win a medal (Medal was 0); in the 1948 year, there were 22 athletes, participated in 2 events, and still did not win a medal, etc.


Model evaluation


We estimate the uncertainty of Olympic medal predictions based on the quantile regression random forest algorithm. By constructing a model of 500 decision trees, we determine the prediction interval by referring to the quantile distribution (5th, 25th, 50th, 75th, 95th) of the prediction results of each subtree, achieving quantification of uncertainty in three aspects: (1) Median prediction: As a point estimate, reflecting the most likely medal distribution. (2) Interquartile range (IQR): Reflecting the central tendency of the middle 50% of prediction values. (3) Prediction interval (PI): Composed of the 5th to 95th percentiles to form a 90% confidence interval.


1) Model evaluation


As shown in Table 2, the model demonstrates good predictive ability on the test set: (1) Comprehensive prediction accuracy: The mean squared error (MSE) is 23.45, and the mean absolute error (MAE) is 3.78 medals. (2) Explanatory power: The coefficient of determination R² reaches 0.92, indicating that the model can explain 92% of the variation in the number of medals. (3) Interval coverage: The overall prediction interval coverage rate is 91.23%, close to the theoretical value of 90%.


Table 3: Performance Metrics for Medal Type Predictions at the 2028 Olympic Games

Media Type

MSE

MAE

Prediction Interval Width

Coverage Rate (%)

Gold

23.45

3.78

0.92

13.9

89.47%

Silver

18.32

3.12

0.89

11.7

92.31%

Bronze

15.67

2.95

0.91

12.3

91.89%


2) Predictive Interval Feature Analysis


Predictive interval distribution characteristics of typical countries (taking China as an example):


Gold Medal Prediction: The maximum interval width is (13.9 pieces), with a median forecast of 22.3 pieces (IQR = 7.3), and a 95% confidence interval of [15.2, 29.1], reflecting the strong uncertainty of top-level competitive sports.


Silver medal prediction: The highest interval coverage rate (92.31%), with a median prediction error of 18.9 pieces (MAE = 3.12), is significantly lower than the historical fluctuation level.


Bronze prediction: IQR is 6.3, with a right-skewed distribution (75th - 95th interquartile range 9.4 vs. 5th - 25th range 2.6).


3)Discussion on Sources of Uncertainty


Predictive interval distribution characteristics of typical countries (taking China as an example):


Gold Medal Prediction: The maximum interval width is (13.9 pieces), with a median forecast of 22.3 pieces (IQR = 7.3), and a 95% confidence interval of [15.2, 29.1], reflecting the strong uncertainty of top-level competitive sports.


Silver medal prediction: The highest interval coverage rate (92.31%), with a median prediction error of 18.9 pieces (MAE = 3.12), is significantly lower than the historical fluctuation level.


Bronze prediction: IQR is 6.3, with a right-skewed distribution (75th - 95th interquartile range 9.4 vs. 5th - 25th range 2.6).


4.2. 2 Build feature engineering


Using the target year as the prediction point, extract historical data up to T - 4 years (such as 2024 year), and generate the following features:


1.Number of Participations (Participations): The total number of times the country has participated before the end of the year, the calculation formula is:


Competing in the year
Y
)


Among them,


To indicate a function, if
NOC
In the year

Compete

Competing in the year
Y=1),
Otherwise
0


2.Total Number of Athletes (AthletesTotal)The total number of athletes from a country in all participating years up to the cutoff year, calculated by the formula:


Representing the country in the year

Number of athletes.


3.Project Diversity (SportsDiversity):The number of different sports projects a country participated in before the end of the year, calculated by the cardinality of the set of projects participated in different years,


That is to say:


The "card" represents the cardinality of the set, that is, the number of elements in the set.


4. Number of athletes in recent years (Recent Athletes)Athletesfrom the national team in the year ending (T-4), namely:


4.2.3 Model Construction, Training, and Solution


Firstly, the data is divided and balanced. The training set only contains data from countries that have never won medals (Medal History = 0). The ADASYN (Adaptive Synthetic Sampling Approach) algorithm has been used for over-sampling on the training set, which synthesizes minority class samples (countries that win for the first time) by increasing their weights, effectively alleviating the data imbalance problem and enabling the model to better learn the features of minority class samples during training. Then, the random forest classifier (Random Forest) is selected as the prediction model. The parameter settings are as follows: the number of decision trees is set to 200, aiming to balance computational efficiency and model complexity; the class weight is set to negative class weight 1, positive class weight 5, to reduce the risk of overfitting by adjusting the class weight; the maximum tree depth is set to 5, the minimum number of samples in leaf nodes is set to 10, thereby enhancing the generalization ability of the model, so that the model can perform well on different datasets. Then, the dynamic threshold is adjusted based on the precision-recall curve of the test set, and the optimal threshold that satisfies recall rate ≥ 15% and precision rate ≥ 25% is selected. If there is no threshold that meets the above conditions after calculating the precision-recall curve, then the 85th percentile of the prediction probability is taken as the threshold.By adjusting dynamic thresholds, optimizing the model's predictive performance, and achieving a better balance when predicting the first-time winning situation of countries that have never won awards.Exclude blacklisted countries (such as SSD, LBN) from the predefined list of countries that have never won awards (NON_MEDAL_COUNTRIES), and apply business rules for filtering: requiring the number of participations (Participations) ≥1, recent number of athletes (RecentAthletes) ≥6, and project diversity (SportsDiversity) ≥2, to filter out candidate countries that meet the conditions.


Extract the features defined in the feature engineering stage from the filtered candidate countries, input these features into the trained random forest model, and the model outputs the probability of each candidate country winning the first time.


Through the construction and training of the model, we predict that the probability of the first winning in the 2028 Olympics exceeds 70% for the following countries: P(ANG) = 0.856079, P(NFL) = 0.855923, P(UNK) = 0.750016, P(PLE) = 0.709846, P(TUV) = 0.709559, P(GBS) = 0.706677.


Graph 4: First-time Winning Probability for Non-Winning Countries


4.2.4 Model Evaluation


Model's AUC-ROC value is 0.89, indicating that the model has strong discriminative ability and can better distinguish whether a country that has not won the award will win for the first time in the future. At the optimal threshold, the accuracy rate is 32%, the recall rate is 28%, reflecting the model's accuracy and completeness in predicting the first-time winning countries. Moreover, through cross-validation, the model performs stably on the test set, with a standard deviation less than 0.03, indicating that the model has good robustness and can maintain relatively stable performance on different data subsets.


The relationship between project settings and the number of medals


Establishment of the model


Through the Hist Gradient Boosting Regressor model, quantify the relationship between competition projects and medal distribution. Gradient boosting is an ensemble learning method based on decision trees, with the core idea of improving the predictive performance of the model by gradually fitting the error.


After iterating M times, the final model is:

()


Among them,


The final prediction model, including all

The cumulative contribution of a tree.

Total number of iterations, i.e., decision tree

The total quantity.

·
The final model is the weighted sum of all weak predictors (trees) and can capture the relationship between complex features and the target variable (number of medals). For example, it can simultaneously explain the host country effect (whether it hosts), the number and type of sports events (such as track and field, swimming, etc.), and their impact on the distribution of medals.


Introduce feature variables


Is HostThe Host Country Symbol, used to measure the home advantage.Host Chosen SportsThe number of sports chosen by the host country, reflecting the host country's ability to influence the number of medals through the addition of new events.Medal EfficiencyThe ratio of the number of medals to the number of events, used to measure the training and selection efficiency of a country.EventsThe total number of events, used to describe the overall distribution of medal opportunities.Individual Event Characteristics(such as athletics, swimming, etc.): Measure the contribution of a country's performance in a specific event to the total number of medals.TotalRepresents the total number of medals won by each country at a particular Olympic Games.


Based on the optimized results, organize the feature importance into a table and sort it by importance


Table 4 Feature Names and Their Importance


Feature name


Feature Importance (Importance)

Is Host

0.467212

Host Chosen Sports

0.267642

Medal Efficiency

0.185143

Events

0.045841

Athletics

0.016510

Swimming

0.014768

Gymnastics

0.001607

Basketball

0.000673

Football

0.000604


Model solution and answer


According to the data in Table 4, the ranking of feature importance is:


Is Host (Host Country): The feature importance is the highest, about 46.7%. The host country effect is the most critical factor affecting the total number of medals, which may be due to: Home advantage: athletes are familiar with the venue, and spectators support. The host country has a significant advantage in its strong event. Judge bias: In subjective scoring events (such as gymnastics, diving, etc.), the host country's athletes often have an advantage.


Chosen Sports HostThe number of sports chosen by the host country: The second most important feature, accounting for26.8%. The host country can increase the number of medals by adding or promoting sports events. Typically, the host country tends to choose events in which it excels,which significantlyboosts the host country's medal tally.


Medal Efficiency(Medal Efficiency): Ranked third, the feature importance is18.5%. This feature reflects the efficiency of athletes from various countries in participating in competitions (the ratio of medal count to the number of events participated in). Countries with higher medal efficiency usually invest more resources in a few advantageous events.


EventsNumber of events: The importance of feature is only4.6%. The total number of events has a certain impact on the total number of medals, but the effect is relatively indirect, and it may be necessary to combine other features (such as the host country or the number of athletes)to play a significant role.


Single sports event (such as track and field, swimming): The importance of a single event is relatively low (less than 2%), which may be because these events contribute evenly to the total number of medals.


Regarding the issue of "the impact of sports events on medal distribution," from the perspective of important sports events: Although the importance of individual events is relatively low, sports with high medal density such as swimming and track and field (respectively about 35 and 48 events) still make a significant basic contribution to the total number of medals. For example, in track and field, almost all countries will invest resources in this field because the number of medals is the highest. In swimming, the distribution of medals is relatively wide, but some developed countries (such as the United States and Australia) have significant advantages. In the gymnastics event, although the number of medals is small, the subjective scoring of judges is common, and the host country's athletes often have an advantage. From the perspective of the number of sports events, the more sports events there are, the greater the opportunity for countries to win medals. The host country often promotes the addition of new events, and this strategy has a significant impact on the distribution of medal numbers. For example, in the 2000 Sydney Olympics, women's weightlifting was added, and Australia performed outstandingly in this event.


Regarding the question of "which sports events are most important for various countries," we have conducted a focused analysis of China, the United States, and Japan, and have drawn the following conclusions: The United States: The United States has long dominated in high-medal density events such as track and field and swimming, thanks to its strong sports infrastructure and extensive training system. China: China has a high gold medal efficiency in a few sports such as diving, table tennis, and gymnastics. These projects have concentrated resources, and the athletes have maintained a leading level of technical skill for a long time. Japan: Japan has a historical advantage in traditional sports such as judo and increased its total medal count in the 2020 home Olympics by adding the sport of karate.


Regarding the question of how the host country's choice of competition events affects the results, we found that the host country enhances its medal share in strong events by adding new competition events. For example, Japan performed strongly in the newly added sports of rock climbing and judo at the 2020 Tokyo Olympics, significantly increasing the number of medals with the advantage of home ground. The host country has a natural advantage in scoring events for judges (such as gymnastics and diving), and adding these events contributes greatly to the host country's medal count.


Great Coach Effect on the Impact of Medal Numbers


Establishment of the model


Dual indicator method


We first adopt the dual indicator method to analyze the medal efficiency (Medal Efficiency, ME) and event participation rate (Event Participation Rate, EPR). Assuming high medal efficiency and high event participation rate reflect elite coach strategies + extensive participation. High medal efficiency but low event participation rate may depend on the elite strategy dominated by a few top coaches.


While the low medal efficiency is due to a lack of coaching resources or insufficient project investment.


1.Medal efficiency calculation formula:

()


Medals: The number of medals won by a country in a certain event. Athletes Count: The number of athletes participating. A small constant to prevent division by zero.


Prevent small constant from causing denominator to be zero
(
For example


2.Calculation formula for event participation rate:

()


Among them, Event Count: the total number of events corresponding to the project.


Prevent small constant with denominator zero.


Multivariate regression model


Establish a regression model and quantify the marginal contribution of the coaching effect. The multivariate linear regression model is as follows:

()


Among them,


Total medals

(Medals)

Number of participating athletes

Athletes
Count) .

Number of events

Event
Count)

Competition participation rate

Event
Participation Rate) .

Regression coefficient, indicating the marginal contribution of the feature to the number of medals

Residual term. Coefficients in the regression model
,
We can judge

Competition participation rate

The impact on the number of medals. Marginal contribution:

The participation rate of the event increases

Translation: one

Unit, contribution to the number of medals


Model solution


We take the United States, Japan, and Cuba as the research objects, and the coefficient of Athletes Count can be obtained by the dual indicator method and the multiple linear regression model, which is 0.1571, meaning that for every increase of 1 in the number of participating athletes, the average number of medals increases by 0.157. The coefficient of Event Participation Rate is -0.0009, meaning that for every 1% increase in the event participation rate, the number of medals will decrease slightly. This indicates that highly elite projects may be more suitable for relying on excellent coaches. This pattern is particularly evident in high-efficiency projects. By using data from high-medal efficiency projects and countries, the contribution of the coach effect can be indirectly estimated:


(1)USA Basketball:Award Efficiency0.999999434indicates nearly perfect scores. A high event participation rate51.74%indicates a combination of widespread participation and elite coaching. If the event participation rate increases 10%and efficiency remains unchanged, based on regression coefficient estimation: the number of awards will increase by approximately0.81medals.


(2)Japanese baseball and softball:The medal efficiency is close to perfect, but the participation rate in the events is slightly lower than that of the United States. If more top coaches are introduced, it can maintain a high efficiency in winning medals.


(3)Cuban baseball:The participation rate in the event is only22.40%, and if the participation rate in the event is increased to30%by increasing the coach resources, the number of medals can be estimated to increase additionally based on the coefficient.


Combining the data from High Efficiency Projects and the results of the regression model, here are our investment recommendations for three countries:


(1) United States (USA): Enhanced Coaching Support for Basketball and Baseball


Reason: The United States has already achieved significant medal efficiency and high participation rate in basketball and baseball. Relying on the training of elite coaches can further consolidate its leading position in these sports.


Recommend: Investing in higher-level training facilities and coach education to strengthen long-term competitiveness.


Estimate the impact: On the basis of the current high efficiency, increase the participation rate of the event by 5% to 10%, and the expected number of medals will increase by 1 to 2.


(2) Japan (JPN): Expand participation in baseball and softball events


Reason: Japan's baseball and softball have top-level strength internationally, but the participation rate in events is relatively low compared to the United States. If more investment is made and more high-level coaches are introduced, their existing advantages can be fully utilized.


Recommend: Expand the domestic baseball competition system and attract excellent foreign coaches.


Estimating the Impact: If the participation rate increases by 10%, based on the regression coefficient calculation, the number of medals can increase by about 0.5 to 1.


(3) Cuba (CUB): Elite Expansion Baseball Program


Reason: Cuba's baseball team performs at an extremely high efficiency, but the participation rate in events is low, possibly relying on a small number of top players. Increasing the investment in coaching resources can expand the number of participants and enhance overall competitiveness.


Suggest: Introduce a foreign "great coach" for the baseball project and strengthen the domestic training system.


Estimate the impact: Increase the participation rate from 22.40% to 30%, and the expected number of medals is expected to increase by 1.


The effect of athletes' age on the number of medals


Data processing


First, collect relevant data containing age and total number of medals. The dataset contains a certain number of samples, reflecting the total number of medals corresponding to different ages. Check the integrity of the data to ensure there are no missing values. If there are missing values, use appropriate methods to fill them, such as mean filling method, median filling method, etc. Then, identify and handle possible outliers. Observe the distribution of data through methods such as box plots, scatter plots, etc., and correct or remove data points that deviate significantly from the normal range according to specific situations.


Model establishment and resolution


We first clean and filter the data, then analyze the data to obtain a stacked bar chart - Figure 5.


Graph 5, the relationship between the number of medals and the age of athletes


After importing the data, we use the Fourier series model to fit the data, to ensure accuracy and prevent overfitting, we adopt the standard model Fourier4, the formula is as follows:

()


Firstly, use the least squares method to estimate the parameters in the model


As well

The goal of the least squares method is to make the actual observed values
(
The actual value of the total number of medals
)
With the model prediction value

The sum of squared errors between them is minimized, that is:

()


Among


For the number of samples

For the

Translation: one

Sample age value

For the

Translation: one

The actual total number of medals of the sample.

Specific estimation process

Through numerical optimization algorithm
(
As gradient descent method, trust region algorithm, etc
)
To solve for

The smallest parameter value. The initial value of the given parameter in the process of solving
(
Based on experience or preliminary analysis, set
),
Then continuously adjust the parameter values until

Reach a minimum value or satisfy certain convergence conditions. After obtaining the estimated parameter values, calculate the parameters.
95%
Confidence interval. The confidence interval can reflect the degree of uncertainty in parameter estimation. Usually, the covariance matrix of parameter estimation is used to calculate the confidence interval, and the specific formula is based on relevant theories in statistics, such as for parameters
(
Here

Waiting
),
Translation: His/Her/Its
95%
Confidence interval is
,
Among

Estimation value for the parameter

Quantile for the standard normal distribution
(
For
95%
Confidence interval
),

Standard error for parameter estimation. The final estimated parameter values and confidence intervals are as follows.

No translation needed

a0 = 157.5 (137.8, 177.2)
a1=-106(-138.4,-73.57)b1=261.3(236.6,285.9)a2=-171.4(-201.5,-141.3)
b2 = -154.7(-184, -125.5)
a3=150.8(122.1,179.6)b3=-109.3(-141.5,-77.08)
a4 = 64.08 (33.14, 95.03)

b4 = 108.9 (78.69, 139.1)
w=0.07627(0.07503, 0.07751)


Againthe Fourier series model can be used to obtain the following dataFigure6)It can be seen fromthatbefore25years old, the number of medals increases with age, and after25years old, the number of medals decreases with age, most of the medalists are in the15~35age group, among whom25years old has the highest proportion of winners, and the number of winners decreases significantly after40years old, with a relatively low proportion.After50years old, almost no athletes have won medals.


Figure 6: Fourier Series Fitting Graph


Model evaluation


We adopt various methods to evaluate the model:


Sum of Squared Errors (SSE)


It measures the overall error size between the model's predicted values and the actual observed values.
SSE
The smaller the value, the better the model fits the data


Coefficient of determination

,
Among

Average of the total number of medals.

The value range is
0
To
1
Between, the closer
1
The stronger the model's ability to explain the data, the better the fitting effect.


3. Adjusted R-squared

):
Considered the number of parameters in the model to avoid overfitting due to the addition of too many parameters.

The calculation formula is
,
Among

For the number of samples

For the number of model parameters


4. Root Mean Square Error (RMSE)

,
It is
SSE
The average square root can intuitively reflect the predicted value and the actual value

Average size of the error between values.


By calculation, we get:


From these evaluation indicators,

And

Mostly high, indicating that the model can explain most of the changes in the total number of medals, with a good fitting effect

Okay.

Root Mean Square Error

Reflects the average error between the predicted value and the actual value within a certain range. Overall, this Fourier series model is good at describing the relationship between age and award.

The relationship between the total number of cards is good.

Sensitivity Analysis and Error Analysis

我们对第一个问题模型进行灵敏度分析: 1. Medal Efficiency(奖牌效率)扰动:输出结果-1.938说明当我们扰动 奖牌效率 (Medal Efficiency) 的值(放大 20%)时,总奖牌数的预测结果发生了显著变化,减少了约 1.94 奖牌。这一结果表明,奖牌效率对总奖牌数的预测具有重要影响。如果奖牌效率有所提高(例如运动员在比赛中的表现更优秀),则国家的奖牌数有可能大幅增加。换句话说,奖牌效率是影响总奖牌数预测的关键因素,可能与运动员的表现质量和训练密切相联。2.Event Participation Rate(赛事参与率)扰动:输出结果0.0当我们扰动赛事参与率 (Event Participation Rate) 的值时,总奖牌数的预测结果 没有变化这说明赛事参与率对总奖牌数的预测并没有明显影响,这表明该特征在当前模型中对奖牌数的预测贡献较小。即使增加某个项目的参与人数,这也没有直接导致预测奖牌数的变化。这可能意味着参赛人数的增加与奖牌数并没有强相关性,或许其他因素(如运动员的整体实力和奖牌效率)对奖牌数的影响更大3.Athletes Count(运动员数量)扰动:输出结果0.0同样地,当我们扰动 运动员数量 (Athletes Count) 的值时,总奖牌数的预测结果没有变化。这说明运动员数量对总奖牌数的预测影响较小。虽然直觉上,运动员人数增加可能带来更多的奖牌机会,但在当前模型中,运动员数量并没有显著影响总奖牌数。这可能是因为模型主要关注的是奖牌效率等特征,而不单纯是参赛人数。4. Total Events(比赛项目数量)扰动:输出结果0.00093当我们扰动比赛项目数量 (Total Events)的值时,总奖牌数的预测结果发生了轻微的变化,增加了约0.00093 奖牌这说明比赛项目数量对总奖牌数的预测有非常小的影响。尽管比赛项目的增多可能为国家队提供更多的奖牌机会,但在该模型中,比赛项目数量并没有显著影响奖牌数。可能是因为奖牌数的增加更依赖于参赛运动员的质量和效率,而非单纯的项目数量。 


We will analyze the third sub-question again: For each parameter, we select several values (for example, 5 values) at equal intervals within the 95% confidence interval to represent the variation of the parameter. For each variation value of each parameter, we substitute it into the Fourier series model and calculate the results within the given age range (assumed to be the interval from the minimum age to the maximum age in the dataset, for example)


From
10
To
97)
Output: Output of the internal model

Then compare the model output when estimating the parameter values, and calculate the relative change rate. The formula for calculating the relative change rate is:

Relative change rate

Among

Is the model output when the parameter takes a variable value

Is the model output when the parameter takes an estimated value.

The average change rate of the parameters is as follows

No translation needed


Parameter


Average relative change rate

a1

110.04%

b1

45.15%

a2

91.11%

b2

62.30%

a3

97.97%

b3

61.51%

a4

101.52%

b4

58.65%

w

80.09%


From these results, we can infer that the average relative change rate of parameters a1 and a4 is relatively high, indicating that these two parameters have a significant impact on the model output (i.e., the predicted total number of medals)

)
The impact is relatively significant, and the model output is relatively sensitive to changes in these two parameters.
b1
The average relative change rate is relatively low, indicating that its impact on the model output is relatively small. In practical applications, if you want to improve the accuracy of the model's predictions, it may be necessary to pay more attention to those parameters with a higher average relative change rate, ensuring that the estimates of these parameters are as accurate as possible.


7 Model Evaluation

Strengths


(1)Question1Weadoptedthe random forest model, which handles the complex nonlinear relationship between input features and target variables well. In the medal prediction problem, the influencing factors (such as medal efficiency, number of athletes, home effect, etc.) are not in a simple linear relationship with the number of medals, and the random forest can flexibly model these complex relationships.


(2)Random Forestcan also handle high-dimensional features and has strong robustness against noisy data and outliers. Even with irrelevant features, Random Forest can still effectively screen out useful features


(3)Question 3We adopt the Fourier series model,which fits various nonlinear relationships well,and handles complex data patterns.There is a nonlinear relationship between the age of athletes and the number of medals they win.The Fourier series can better describe this complex relationship.


(4)Fourier seriescanalso filter out high-frequency noise, focusing on the main data trends and patterns, which is particularly important forprocessingolympic medal data affected by various external factors.


Weaknesses


(1)Regarding issueone, the random forest model has a weak ability to model time series features and cannot directly capture the dynamic changes of the features


(2)Regarding question three, the Fourier series model may overfit the noise when fitting data, leading to a decrease in the model's generalization ability on new data. To avoid overfitting, appropriate regularization treatment of the model is required.

Conclusion


For the first question, we adopted the random forest model and gradient boosting decision tree to predict the number of medals for the 2028 Olympic Games. We found that China, Serbia, and other countries are expected to increase their medal count by about 2 medals, the United States may decrease by 6 medals, Japan may decrease by 4 medals, Australia and other countries may decrease by 2 to 3 medals, while South Korea, Spain, and other countries are expected to maintain the same number of medals. Palestine, Angola, and other countries that have never won Olympic medals have a probability of more than 70% of winning medals at the 2028 Olympics. Regarding the issue of the impact of Olympic events on the number of medals, although the importance of individual events is relatively low, the basic contribution of high-medal density events such as swimming and track and field to the total number of medals is still significant. The United States has long dominated high-medal density events such as track and field and swimming. Japan has a historical advantage in traditional events such as judo. China has a high gold medal efficiency in a few events such as diving, table tennis, and gymnastics.


Regarding the second question, we analyzed the great coach effect using a dual indicator method and a multivariate linear regression model, and recommend that the United States strengthen the coaching support for basketball and baseball, recommend that Japan expand the participation in baseball events, recommend that Cuba professionalize the expansion of the baseball program.


For the third question, we analyzed the impact of athletes' ages on the number of medals using the Fourier series model. The conclusion is: most medal-winning athletes are in the age range of 15 to 35, among whom those around 25 years old account for the largest proportion of winners, and the number of winners over 40 years old decreases significantly.

References


[1]Wu, D. T., & Wu, Y. (2008). The possibility of China surpassing the United States in gold medals at the 2008 Beijing Olympics: Analysis and prediction based on the host effect. Statistical Research, (03), 60-64.https://doi.org/10.19343/j.cnki.11-1302/c.2008.03.012

[2]Wang, F. (2019). Prediction of medal performance at the 2020 Olympics based on neural networks. Statistics and Decision, 35(05), 89-91. https://doi.org/10.13546/j.cnki.tjyjc.2019.05.019


[3] Lin, Y. P., & Wang, J. J. (2007). Predicting the number of medals at the 2008 Olympics using time series analysis. Journal of Nanjing Institute of Physical Education (Natural Science Edition), (01), 31-32
.


[4]Wang, G. F., Xue, E. J., & Tang, X. F. (2010). Research on medal prediction in large-scale international comprehensive sports events: A case study of the Beijing Olympics. Journal of Tianjin University of Sport, 25(01), 86-90.https://doi.org/10.13297/j.cnki.issn1005-0000.2010.01.007


[]5]Wang, G. F., & Tang, X. F. (2009). Domestic and international research trends and development directions in Olympic medal predictionChina Sport Science and Technology, 45(06), 3-7+135. https://doi.org/10.16470/j.csst.2009.06.016