10.1 Introduction … 171 10.1 引言 ... 171
10.2 Simple Regression Analysis … 174 10.2 简单回归分析...... 174
10.3 Estimating Regression Parameters … 175
10.4 The TT-test … 182 10.4 TT 测试 ... 182
10.5 Multiple Regression Analysis … 184 10.5 多元回归分析...... 184
10.6 Explained Variance … 185 10.6 解释方差...... 185
10.7 The ANOVA Table and FF-test … 188 10.7 方差分析表和 FF 检验 ... 188
10.8 Too Many or Too Few Independent Variables … 192
10.9 Regression with Dummy Variables … 195
10.10 Dummy Regression: An Alternative Analysis of Covariance ANCOVA … 199 10.10 假回归:
10.11 The Classic Assumptions of Multiple Regression … 201
10.12 Summary … 209 10.12 小结 ... 209
References … 210 参考文献 ... 210
10.1 Introduction 10.1 导言
Imagine that someone has a lot of data about you: where you work, go to school, what you like, what you do not like, who your friends are, and so on. Then, based on this data, with great accuracy they can predict your behaviors and choices. What car you will purchase, where you will go on holiday, what you will buy your mother for her birthday. They can do this because most human behaviors are rational in so far as they follow consistent patterns; we are creatures of habit. And, they can predict your behavior because they know how to use regression analysis. 想象一下,有人掌握了关于你的大量数据:你在哪里工作、上学,你喜欢什么、不喜欢什么,你的朋友是谁,等等。然后,根据这些数据,他们可以非常准确地预测你的行为和选择。你会买什么车,去哪里度假,给妈妈买什么生日礼物。它们之所以能做到这一点,是因为人类的大多数行为都是理性的,因为它们遵循一贯的模式;我们是习惯的动物。而且,他们可以预测你的行为,因为他们知道如何使用回归分析。
In this chapter, we explain regression analysis. It is the most classic and widely used statistical method for analyzing the dependency relationship between two or 在本章中,我们将解释回归分析。它是分析两个或两个以上变量之间依存关系的最经典、最广泛使用的统计方法。
more variables. Regression analysis estimates the effect of one or more independent variables (X)(X) on a single dependent variable (Y)(Y). We use the following notation: 更多变量。回归分析估计一个或多个自变量 (X)(X) 对单一因变量 (Y)(Y) 的影响。我们使用以下符号:
XX
Variable XX (independent variable) 变量 XX (自变量)
YY
Variable YY (dependent variable) 变量 YY (因变量)
beta_(1)\beta_{1}
Regression parameter 回归参数
beta_(0)\beta_{0}
The constant (regression intercept) 常数(回归截距)
widehat(Y)\widehat{Y}
The predicted value of Y Y 的预测值
widehat(beta)\widehat{\beta}
The estimate of beta 贝塔的估计值
epsi_(i)\varepsilon_{\mathrm{i}}
The error term (also called the disturbance term) 误差项(也称干扰项)
e_(i)e_{i}
The estimate of the error term (alternative notation: widehat(epsi)_(i)\widehat{\varepsilon}_{i} ) 误差项的估计值(替代符号: widehat(epsi)_(i)\widehat{\varepsilon}_{i} )
R^(2)R^{2}
Explained variance of a regression equation (coefficient of determination) 回归方程的解释方差(决定系数)
bar(R)^(2)\bar{R}^{2}
R-square adjusted. Also expressed as R_(adj)^(2)R_{a d j}^{2} 调整后的 R 方差。也用 R_(adj)^(2)R_{a d j}^{2} 表示
TSST S S
Total sum of squares (total variance) 总平方和(总方差)
RSSR S S
Residual sum of squares (residual variance) 残差平方和(残差方差)
ESSE S S
Explained sum of squares (explained variance) 解释的平方和(解释的方差)
FF
Test-statistic for the FF distribution FF 分布的测试统计量
H_(0)H_{0}
Null hypothesis 零假设
H_(i)H_{i}
Alternative hypothesis, where ii is any whole number (1,2,3,dots i)(1,2,3, \ldots i) 替代假设,其中 ii 是任意整数 (1,2,3,dots i)(1,2,3, \ldots i)
nn
Sample size 样本量
OLSO L S
Ordinary lease squares estimation 普通租赁平方估算
|t||t|
Absolute value of the test-statistic for the tt-distribution tt 分布的检验统计量绝对值
Corr(X,Y)C o r r(X, Y)
Correlation between variable XX and variable YY 变量 XX 与变量 YY 之间的相关性
kk
kk independent variables (X_(1),X_(2),dotsX_(k))\left(X_{1}, X_{2}, \ldots X_{k}\right) kk 自变量 (X_(1),X_(2),dotsX_(k))\left(X_{1}, X_{2}, \ldots X_{k}\right)
i
The ith observation 第 i 次观测
X Variable X (independent variable)
Y Variable Y (dependent variable)
beta_(1) Regression parameter
beta_(0) The constant (regression intercept)
widehat(Y) The predicted value of Y
widehat(beta) The estimate of beta
epsi_(i) The error term (also called the disturbance term)
e_(i) The estimate of the error term (alternative notation: widehat(epsi)_(i) )
R^(2) Explained variance of a regression equation (coefficient of determination)
bar(R)^(2) R-square adjusted. Also expressed as R_(adj)^(2)
TSS Total sum of squares (total variance)
RSS Residual sum of squares (residual variance)
ESS Explained sum of squares (explained variance)
F Test-statistic for the F distribution
H_(0) Null hypothesis
H_(i) Alternative hypothesis, where i is any whole number (1,2,3,dots i)
n Sample size
OLS Ordinary lease squares estimation
|t| Absolute value of the test-statistic for the t-distribution
Corr(X,Y) Correlation between variable X and variable Y
k k independent variables (X_(1),X_(2),dotsX_(k))
i The ith observation| $X$ | Variable $X$ (independent variable) |
| :--- | :--- |
| $Y$ | Variable $Y$ (dependent variable) |
| $\beta_{1}$ | Regression parameter |
| $\beta_{0}$ | The constant (regression intercept) |
| $\widehat{Y}$ | The predicted value of Y |
| $\widehat{\beta}$ | The estimate of beta |
| $\varepsilon_{\mathrm{i}}$ | The error term (also called the disturbance term) |
| $e_{i}$ | The estimate of the error term (alternative notation: $\widehat{\varepsilon}_{i}$ ) |
| $R^{2}$ | Explained variance of a regression equation (coefficient of determination) |
| $\bar{R}^{2}$ | R-square adjusted. Also expressed as $R_{a d j}^{2}$ |
| $T S S$ | Total sum of squares (total variance) |
| $R S S$ | Residual sum of squares (residual variance) |
| $E S S$ | Explained sum of squares (explained variance) |
| $F$ | Test-statistic for the $F$ distribution |
| $H_{0}$ | Null hypothesis |
| $H_{i}$ | Alternative hypothesis, where $i$ is any whole number $(1,2,3, \ldots i)$ |
| $n$ | Sample size |
| $O L S$ | Ordinary lease squares estimation |
| $\|t\|$ | Absolute value of the test-statistic for the $t$-distribution |
| $C o r r(X, Y)$ | Correlation between variable $X$ and variable $Y$ |
| $k$ | $k$ independent variables $\left(X_{1}, X_{2}, \ldots X_{k}\right)$ |
| i | The ith observation |
Regression analysis is one of many statistical methods used to evaluate the relationship between one or more so-called independent variables X_(1),X_(2),dotsX_(k)X_{1}, X_{2}, \ldots X_{\mathrm{k}} and a dependent variable YY. Specifically, regression analysis assesses how changes in the independent variables explain changes in the dependent variable. For example, regression can be used to study: 回归分析是众多统计方法中的一种,用于评估一个或多个所谓的自变量 X_(1),X_(2),dotsX_(k)X_{1}, X_{2}, \ldots X_{\mathrm{k}} 与因变量 YY 之间的关系。具体来说,回归分析评估自变量的变化如何解释因变量的变化。例如,回归可用于研究:
Patient satisfaction (Y)(Y), and the independent variables physician communication (X_(1))\left(X_{1}\right) and service level during hospital stay (X_(2))\left(X_{2}\right). 患者满意度 (Y)(Y) ,自变量医生沟通 (X_(1))\left(X_{1}\right) 和住院期间服务水平 (X_(2))\left(X_{2}\right) 。
Demand for a product (Y)(Y), and the independent variables price (X_(1))\left(X_{1}\right), advertising (X_(2))\left(X_{2}\right), and market share (X_(3))\left(X_{3}\right). 产品需求量 (Y)(Y) ,自变量价格 (X_(1))\left(X_{1}\right) 、广告 (X_(2))\left(X_{2}\right) 和市场份额 (X_(3))\left(X_{3}\right) 。
Change in blood pressure (Y)(Y), and hypertension medication (X_(1))\left(X_{1}\right). 血压变化 (Y)(Y) 和高血压药物 (X_(1))\left(X_{1}\right) 。
Crop yield (Y)(Y), and rainfall (X_(1))\left(X_{1}\right), air temperature (X_(2))\left(X_{2}\right), and soil nitrogen content (X_(3))\left(X_{3}\right). 作物产量 (Y)(Y) ,以及降雨量 (X_(1))\left(X_{1}\right) 、气温 (X_(2))\left(X_{2}\right) 和土壤含氮量 (X_(3))\left(X_{3}\right) 。
While these examples imply a cause-effect relationship from the independent ( XX ) variables to the dependent (Y)(Y) variable, regression itself does not prove causality. It simply tests whether the modeled correlations are significantly different from zero. 虽然这些例子意味着自变量( XX )与因变量 (Y)(Y) 之间存在因果关系,但回归本身并不能证明因果关系。它只是检验建模的相关性是否与零有显著差异。
Fig. 10.1 Research model for factors affecting hotel revenue 图 10.1 酒店收入影响因素研究模型
For example, we believe that occupancy rates, price, advertising, competition, and service level determine the amount of revenue at a hotel (see Fig. 10.1). We cannot prove this, but with regression, we can test for significant relationships between the five independent variables and the dependent variable, revenue. If we find significant relationships, we can predict revenue for different values of each independent variable. That is, how much would a change in occupancy, price, advertising, competition, or service change revenue. Causality is based on robust argumentation and theory and supported (or refuted) by evidence. 例如,我们认为入住率、价格、广告、竞争和服务水平决定了酒店的收入额(见图 10.1)。我们无法证明这一点,但通过回归,我们可以检验五个自变量与因变量收入之间是否存在显著关系。如果我们找到了重要关系,就可以预测每个自变量的不同值带来的收益。也就是说,入住率、价格、广告、竞争或服务的变化会对收入产生多大的影响。因果关系建立在可靠的论证和理论基础上,并有证据支持(或反驳)。
We expect that occupancy, advertising, and service will have a positive effect on revenue, while price and competition are expected to have a negative effect. This means that if the occupancy rate goes up, advertising is increased, or service is improved, then revenue will increase. Whereas if the price goes up or the number of competitors increases, revenue will decrease. The choice of independent variables should be logical, and if possible, theoretically justified. We mean that it should be possible to argue for each variable based on common sense, and when appropriate, established theory. 我们预计,入住率、广告和服务会对收入产生积极影响,而价格和竞争则会产生消极影响。这意味着,如果入住率上升、广告增加或服务改善,那么收入就会增加。而如果价格上涨或竞争者数量增加,收入就会减少。自变量的选择应合乎逻辑,如有可能,还应具有理论依据。我们的意思是,应该能够根据常识对每个变量进行论证,在适当的时候,也可以根据已有的理论进行论证。
Most often, the relationship between the dependent and independent variables is assumed to be linear (a straight line). In our example, this means that revenue is assumed to be a linear function of the independent variables. It is important to understand that the five independent variables do not account for everything that will cause revenue to rise or fall. To account for all the other possible things that could affect revenue, we include an error term in the regression equation. It is often symbolized by the Greek letter epsilon (epsi)(\varepsilon). This is the theoretical regression model for the model in Fig. 10.1: 通常情况下,因变量和自变量之间的关系被假定为线性关系(一条直线)。在我们的例子中,这意味着假设收入是自变量的线性函数。重要的是要明白,这五个自变量并不包含所有会导致收入增加或减少的因素。为了考虑所有其他可能影响收入的因素,我们在回归方程中加入了误差项。误差项通常用希腊字母 Epsilon (epsi)(\varepsilon) 表示。这就是图 10.1 中模型的理论回归模型:
Note that we use plus signs (+) when expressing the general regression equation, even though we expect price and competition to have a negative relationship with revenue. When the beta coefficients are estimated, if they are negative, they will have the proper relationship with the dependent variable (a negative and a positive equal a negative). 请注意,在表达一般回归方程时,我们使用了加号 (+),尽管我们预计价格和竞争与收入呈负相关。在估计贝塔系数时,如果它们是负数,则它们与因变量之间会有适当的关系(负数和正数等于负数)。 beta_(0)\beta_{0} (beta null) is called the constant or sometimes the YY-intercept. It indicates where the regression line intercepts the Y -axis in a two-dimensional plane. The beta_(0)\beta_{0} (β 空)称为常数,有时也称为 YY -截距。它表示回归线在二维平面上与 Y 轴的截距。常数
constant beta_(0)\beta_{0} is what the value of the dependent variable YY (revenue) would be if all the independent variables (occupancy, price, advertising, competition, and service) were equal to zero. Often, the constant is pragmatically meaningless, only making sense from a mathematical perspective. Imagine using height to predict a person’s body weight. You collect data from 50 respondents measuring their height and weight. In a regression, when height is zero, weight will almost certainly be a nonzero number on the YY-axis. In reality, if height could be zero (which is impossible), weight could not be anything but zero. Keep in mind that you must include the constant in a regression equation, even when it does not have practical meaning. 常数 beta_(0)\beta_{0} 是指如果所有自变量(占用率、价格、广告、竞争和服务)都等于零,因变量 YY (收入)的值。通常情况下,常数没有实际意义,只有从数学角度来看才有意义。想象一下用身高来预测一个人的体重。您收集了 50 位受访者测量身高和体重的数据。在回归中,当身高为零时,体重几乎肯定会是 YY 轴上的非零数字。实际上,如果身高可能为零(这是不可能的),那么体重也不可能为零。请记住,您必须在回归方程中包含常数,即使它没有实际意义。
The betas ( beta_(i)\beta_{i}, where i=1,2,3,4,5i=1,2,3,4,5 ) in front of each independent variable are called the beta coefficients, and epsi\varepsilon is the error term. The right side of the equation (revenue is on the left side) can be divided into two parts. The error term represents the unexplained part, and the independent variables represent the explained part. Note, that even though we call it the error term, it does not necessarily refer to errors. Sometimes, it is called the disturbance term or the residual term. It must be included in the equation to represent the variance that is left over (i.e., not explained by the independent variables). The ambition with regression is to have the unexplained part epsi\varepsilon as small as possible. In other words, we want a regression equation with as much explanatory power as possible. R^(2)R^{2} is a measure of the explained variance in regression. 每个自变量前面的贝塔系数( beta_(i)\beta_{i} ,其中 i=1,2,3,4,5i=1,2,3,4,5 )称为贝塔系数, epsi\varepsilon 为误差项。等式的右边(收入在左边)可分为两部分。误差项代表未解释部分,自变量代表已解释部分。请注意,尽管我们称之为误差项,但它并不一定指误差。有时,它被称为干扰项或残差项。方程中必须包含它,以表示剩余的方差(即未被自变量解释的方差)。回归的目的是使未解释部分 epsi\varepsilon 越小越好。换句话说,我们希望回归方程具有尽可能大的解释力。 R^(2)R^{2} 是回归中解释方差的度量。
The regression coefficients beta_(i)(i=1,2,3,4,5)\beta_{i}(i=1,2,3,4,5) indicate the “isolated” effect that each independent variable has on the dependent variable, revenue. For example, all else equal, beta_(3)\beta_{3} will tell how much effect a one-unit change in advertising will have on revenue. All else equal means that we assume the other independent variables (occupancy, price, competition, and service) are kept constant. If, for example, the beta coefficient beta_(3)\beta_{3} turns out to be 2.09, and advertising and revenue are measured in millions of kronor (the unit of measurement), then increasing advertising efforts by one million kronor will result in an expected revenue increase of 2,090,000, provided that the other variables are not changed. This is the predicted change in revenue. 回归系数 beta_(i)(i=1,2,3,4,5)\beta_{i}(i=1,2,3,4,5) 表示每个自变量对因变量收入的 "孤立 "影响。例如,在其他条件不变的情况下, beta_(3)\beta_{3} 表示广告费变化一个单位会对收入产生多大影响。其他条件不变是指我们假设其他自变量(占用率、价格、竞争和服务)保持不变。例如,如果贝塔系数 beta_(3)\beta_{3} 结果为 2.09,而广告和收入都以百万克朗(计量单位)为单位,那么在其他变量不变的情况下,增加 100 万克朗的广告投入将导致预期收入增加 209 万克朗。这就是预测的收入变化。
Regression is common across many subjects and disciplines: finance, medicine, psychology, agriculture, and even sports. Movie tip: Based on a true story, watch the movie Moneyball with Brad Pitt and Jonah Hill. Billy Beane (Pitt) hires Paul DePodesta (Hill) to predict which over-valued players to trade away and which under-valued players to acquire. Based on solid data analysis, including regression, they built an inexpensive baseball team that won a record 20 straight games in the American baseball league’s 2002 season. Arguably, they changed how baseball is managed through data analysis. 回归在许多科目和学科中都很常见:金融、医学、心理学、农业,甚至体育。电影提示:根据真实故事改编,请观看布拉德-皮特(Brad Pitt)和乔纳-希尔(Jonah Hill)主演的电影《金钱球》(Moneyball)。比利-比恩(Billy Beane,皮特饰)雇佣保罗-德波德斯塔(Paul DePodesta,希尔饰)预测哪些价值过高的球员需要交易,哪些价值过低的球员需要收购。基于扎实的数据分析,包括回归分析,他们建立了一支物美价廉的棒球队,在美国棒球联盟 2002 赛季中创纪录地取得了 20 连胜。可以说,他们通过数据分析改变了棒球的管理方式。
10.2 Simple Regression Analysis 10.2 简单回归分析
Imagine a stochastic (random) process, which includes a dependent variable YY and several independent variables X_(1),X_(2),X_(3),dots,X_(k)X_{1}, X_{2}, X_{3}, \ldots, X_{k}. We could write Y=F(X_(1),X_(2),X_(3),dots:}Y=F\left(X_{1}, X_{2}, X_{3}, \ldots\right.{:X_(k))+epsi\left.X_{k}\right)+\varepsilon, which means that YY is a function of the independent variables 想象一个随机(随机)过程,其中包括一个因变量 YY 和几个自变量 X_(1),X_(2),X_(3),dots,X_(k)X_{1}, X_{2}, X_{3}, \ldots, X_{k} 。我们可以把 Y=F(X_(1),X_(2),X_(3),dots:}Y=F\left(X_{1}, X_{2}, X_{3}, \ldots\right. 写成 {:X_(k))+epsi\left.X_{k}\right)+\varepsilon ,这意味着 YY 是自变量的函数
Fig. 10.2 Simple regression line 图 10.2 简单回归线
Fig. 10.3 Simple regression model 图 10.3 简单回归模型
(X_(i))\left(X_{i}\right) plus the random term. Given the extreme unlikelihood of knowing all the independent variables in a function, and the unlikelihood of having a perfect set of population data, the true model is most often unknown. Instead, we try to identify the most important independent variables and measure them (plus the dependent variable) in a sample. In its simplest form, simple regression analysis estimates the relationship of one independent variable on one dependent variable. The theoretical model may be expressed as follows: (X_(i))\left(X_{i}\right) 加上随机项。由于极不可能知道函数中的所有自变量,也不可能有一组完美的人口数据,因此真正的模型往往是未知的。相反,我们试图找出最重要的自变量,并在样本中测量它们(加上因变量)。简单回归分析的最简单形式是估计一个自变量与一个因变量之间的关系。理论模型可表述如下:
Where YY is the dependent variable and XX is the independent variable. beta_(0)\beta_{0} is the constant (also the YY-axis intercept) and beta_(1)\beta_{1} is the slope of the regression line. epsi\varepsilon represents the error term. The explained part beta_(0)+beta_(1)X_(1)\beta_{0}+\beta_{1} X_{1}, can be represented graphically as a straight line in a coordinate system where the constant beta_(0)\beta_{0} is the YY intercept for the line and beta_(1)\beta_{1} is the slope of the line (see Fig. 10.2). The beta parameters beta_(0)\beta_{0} and beta_(1)\beta_{1} are what we want to estimate with the regression equation. Given that we are not including X_(2),X_(3),X_(4),dotsX_(k)X_{2}, X_{3}, X_{4}, \ldots X_{k}, in the equation, their missing contribution to the equation is captured by the error term epsi\varepsilon. Other variation in the error term can come from measurement error, an incorrect functional form (e.g., should the regression line actually be curvilinear), or pure coincidence. 其中, YY 是因变量, XX 是自变量。 beta_(0)\beta_{0} 是常数(也是 YY 轴截距), beta_(1)\beta_{1} 是回归线的斜率。 epsi\varepsilon 代表误差项。解释部分 beta_(0)+beta_(1)X_(1)\beta_{0}+\beta_{1} X_{1} 可以用坐标系中的直线来表示,其中常数 beta_(0)\beta_{0} 是直线的 YY 截距, beta_(1)\beta_{1} 是直线的斜率(见图 10.2)。贝塔参数 beta_(0)\beta_{0} 和 beta_(1)\beta_{1} 是我们想用回归方程估算的。鉴于我们没有将 X_(2),X_(3),X_(4),dotsX_(k)X_{2}, X_{3}, X_{4}, \ldots X_{k} 包括在方程中,它们对方程的缺失贡献被误差项 epsi\varepsilon 所捕捉。误差项的其他变化可能来自测量误差、不正确的函数形式(例如,回归线实际上应该是曲线)或纯粹的巧合。
10.3 Estimating Regression Parameters 10.3 估计回归参数
We use the Hotel data to show an example of simple regression (see Fig. 10.3). 我们用酒店数据来展示一个简单回归的例子(见图 10.3)。
The hypothesis could be written: H_(1)H_{1} : Occupancy has a positive effect on revenue. 假设可以这样写 H_(1)H_{1} :入住率对收入有积极影响。
In other words, we believe that the more rooms occupied in the hotel, the higher the hotel’s revenue, measured in kronor. In general, when writing hypotheses in 换句话说,我们认为酒店客房入住率越高,酒店收入(以克朗计算)就越高。一般来说,在撰写假设时
Fig. 10.4 Simple regression line with residuals and outlier 图 10.4 带有残差和离群值的简单回归线
research reports, the alternative hypothesis is presented and the null hypothesis is implicit. 在研究报告中,会提出替代假设,并隐含零假设。
With the Hotel data, imagine that managers have been asked to provide occupancy data and revenue data. Occupancy and revenue are unlikely to be perfectly related. That is, a specific occupancy does not lead to exactly the same revenue in every hotel. Figure 10.4 shows how the data (the dots) may be represented in a two-dimensional plane. The data in this example are randomly spread in a somewhat linear fashion with a positive slope. The relationship between the two variables can be represented by the regression line that intersects them. The goal of the regression is to estimate the most efficient line intersecting the data. We are demonstrating ordinary least squares (OLS) regression, which is one of the most common estimation methods. The residuals are the distances from the data points to the regression line. They represent the error (unexplained variance) in the equation. If the data is more spread, the resdiduals get bigger, and the regression has more unexplained variance. We have also included one piece of data called an outlier. This would be a hotel that, despite very low occupancy, has very high revenue. Outliers often have an inordinately large influence on the parameter estimates, so they must be evaluated and possibly removed from the data. Perhaps the outlier is a luxury hotel catering to a few very wealthy clients. You would have to ask yourself whether it is appropriate to it keep in the dataset. If, for example, you are trying to draw general conclusions about typical hotels, it may be best to remove the luxury hotels from the sample. 通过酒店数据,可以想象管理者被要求提供入住率数据和收入数据。入住率和收入不太可能完全相关。也就是说,特定的入住率并不会给每家酒店带来完全相同的收益。图 10.4 显示了如何在二维平面上表示数据(点)。本例中的数据是以某种线性方式随机分布的,斜率为正。两个变量之间的关系可以用它们相交的回归线来表示。回归的目的是估算出与数据相交的最有效直线。我们正在演示普通最小二乘法(OLS)回归,这是最常用的估计方法之一。残差是数据点到回归线的距离。它们代表方程中的误差(无法解释的方差)。如果数据比较分散,残差就会变大,回归的未解释方差也会变大。我们还加入了一个称为离群值的数据。这指的是一家酒店,尽管入住率很低,但收入却很高。异常值通常会对参数估计产生过大的影响,因此必须对其进行评估,并在可能的情况下将其从数据中剔除。也许异常值是一家豪华酒店,只接待少数非常富有的客户。您必须问问自己,将其保留在数据集中是否合适。例如,如果您正试图对典型酒店得出一般性结论,那么最好将豪华酒店从样本中剔除。
To run the regression analysis in SPSS with the Hotel data, choose: Analyze >> regression >> Linear. Put Revenue in the Dependent box and Occupancy in the Independent(s) box, Click OK (see Fig. 10.5). 要在 SPSS 中对酒店数据进行回归分析,请选择分析 >> 回归 >> 线性。在 "因变量 "框中填入 "收入",在 "自变量 "框中填入 "入住率",单击确定(见图 10.5)。
The Model Summary table (see Table 10.1) shows R^(2)R^{2} and R_(adj)^(2)R_{a d j}^{2}. They are sometimes referred to as the coefficient of determination. They represent two measures of the explained variance in the regression equation. R_(adj)^(2)R_{a d j}^{2} is based on R^(2)R^{2}, with the additional element of taking into account the number of independent variables. Parsimony is a virtue in regression models. In general, you should aim to find the simplest possible regression model to explain the most amount of variance in the dependent variable. From this perspective, adding more independent variables to explain slightly more variance is counterproductive. R_(adj)^(2)R_{a d j}^{2} includes a 模型汇总表(见表 10.1)显示了 R^(2)R^{2} 和 R_(adj)^(2)R_{a d j}^{2} 。它们有时被称为判定系数。它们是回归方程中解释方差的两个量度。 R_(adj)^(2)R_{a d j}^{2} 以 R^(2)R^{2} 为基础,另外考虑了自变量的数量。解析性是回归模型的一个优点。一般来说,您应该力求找到最简单的回归模型,以解释因变量中最大的方差。从这个角度来看,增加更多的自变量来解释更多的方差只会适得其反。 R_(adj)^(2)R_{a d j}^{2} 包括一个
Fig. 10.5 Regression dialogue box 图 10.5 回归对话框
Table 10.1 Model summary: R^(2)R^{2} and R_(adj)^(2)R_{a d j}^{2} 表 10.1 模型摘要: R^(2)R^{2} 和 R_(adj)^(2)R_{a d j}^{2}