10.1 Introduction … 171 10.1 引言 ... 171
10.2 Simple Regression Analysis … 174 10.2 简单回归分析...... 174
10.3 Estimating Regression Parameters … 175
10.4 The TT-test … 182 10.4 TT 测试 ... 182
10.5 Multiple Regression Analysis … 184 10.5 多元回归分析...... 184
10.6 Explained Variance … 185 10.6 解释方差...... 185
10.7 The ANOVA Table and FF-test … 188 10.7 方差分析表和 FF 检验 ... 188
10.8 Too Many or Too Few Independent Variables … 192
10.9 Regression with Dummy Variables … 195
10.10 Dummy Regression: An Alternative Analysis of Covariance ANCOVA … 199 10.10 假回归:
10.11 The Classic Assumptions of Multiple Regression … 201
10.12 Summary … 209 10.12 小结 ... 209
References … 210 参考文献 ... 210
10.1 Introduction 10.1 导言
Imagine that someone has a lot of data about you: where you work, go to school, what you like, what you do not like, who your friends are, and so on. Then, based on this data, with great accuracy they can predict your behaviors and choices. What car you will purchase, where you will go on holiday, what you will buy your mother for her birthday. They can do this because most human behaviors are rational in so far as they follow consistent patterns; we are creatures of habit. And, they can predict your behavior because they know how to use regression analysis. 想象一下,有人掌握了关于你的大量数据:你在哪里工作、上学,你喜欢什么、不喜欢什么,你的朋友是谁,等等。然后,根据这些数据,他们可以非常准确地预测你的行为和选择。你会买什么车,去哪里度假,给妈妈买什么生日礼物。它们之所以能做到这一点,是因为人类的大多数行为都是理性的,因为它们遵循一贯的模式;我们是习惯的动物。而且,他们可以预测你的行为,因为他们知道如何使用回归分析。
In this chapter, we explain regression analysis. It is the most classic and widely used statistical method for analyzing the dependency relationship between two or 在本章中,我们将解释回归分析。它是分析两个或两个以上变量之间依存关系的最经典、最广泛使用的统计方法。
more variables. Regression analysis estimates the effect of one or more independent variables (X)(X) on a single dependent variable (Y)(Y). We use the following notation: 更多变量。回归分析估计一个或多个自变量 (X)(X) 对单一因变量 (Y)(Y) 的影响。我们使用以下符号:
XX
Variable XX (independent variable) 变量 XX (自变量)
YY
Variable YY (dependent variable) 变量 YY (因变量)
beta_(1)\beta_{1}
Regression parameter 回归参数
beta_(0)\beta_{0}
The constant (regression intercept) 常数(回归截距)
widehat(Y)\widehat{Y}
The predicted value of Y Y 的预测值
widehat(beta)\widehat{\beta}
The estimate of beta 贝塔的估计值
epsi_(i)\varepsilon_{\mathrm{i}}
The error term (also called the disturbance term) 误差项(也称干扰项)
e_(i)e_{i}
The estimate of the error term (alternative notation: widehat(epsi)_(i)\widehat{\varepsilon}_{i} ) 误差项的估计值(替代符号: widehat(epsi)_(i)\widehat{\varepsilon}_{i} )
R^(2)R^{2}
Explained variance of a regression equation (coefficient of determination) 回归方程的解释方差(决定系数)
bar(R)^(2)\bar{R}^{2}
R-square adjusted. Also expressed as R_(adj)^(2)R_{a d j}^{2} 调整后的 R 方差。也用 R_(adj)^(2)R_{a d j}^{2} 表示
TSST S S
Total sum of squares (total variance) 总平方和(总方差)
RSSR S S
Residual sum of squares (residual variance) 残差平方和(残差方差)
ESSE S S
Explained sum of squares (explained variance) 解释的平方和(解释的方差)
FF
Test-statistic for the FF distribution FF 分布的测试统计量
H_(0)H_{0}
Null hypothesis 零假设
H_(i)H_{i}
Alternative hypothesis, where ii is any whole number (1,2,3,dots i)(1,2,3, \ldots i) 替代假设,其中 ii 是任意整数 (1,2,3,dots i)(1,2,3, \ldots i)
nn
Sample size 样本量
OLSO L S
Ordinary lease squares estimation 普通租赁平方估算
|t||t|
Absolute value of the test-statistic for the tt-distribution tt 分布的检验统计量绝对值
Corr(X,Y)C o r r(X, Y)
Correlation between variable XX and variable YY 变量 XX 与变量 YY 之间的相关性
kk
kk independent variables (X_(1),X_(2),dotsX_(k))\left(X_{1}, X_{2}, \ldots X_{k}\right) kk 自变量 (X_(1),X_(2),dotsX_(k))\left(X_{1}, X_{2}, \ldots X_{k}\right)
i
The ith observation 第 i 次观测
X Variable X (independent variable)
Y Variable Y (dependent variable)
beta_(1) Regression parameter
beta_(0) The constant (regression intercept)
widehat(Y) The predicted value of Y
widehat(beta) The estimate of beta
epsi_(i) The error term (also called the disturbance term)
e_(i) The estimate of the error term (alternative notation: widehat(epsi)_(i) )
R^(2) Explained variance of a regression equation (coefficient of determination)
bar(R)^(2) R-square adjusted. Also expressed as R_(adj)^(2)
TSS Total sum of squares (total variance)
RSS Residual sum of squares (residual variance)
ESS Explained sum of squares (explained variance)
F Test-statistic for the F distribution
H_(0) Null hypothesis
H_(i) Alternative hypothesis, where i is any whole number (1,2,3,dots i)
n Sample size
OLS Ordinary lease squares estimation
|t| Absolute value of the test-statistic for the t-distribution
Corr(X,Y) Correlation between variable X and variable Y
k k independent variables (X_(1),X_(2),dotsX_(k))
i The ith observation| $X$ | Variable $X$ (independent variable) |
| :--- | :--- |
| $Y$ | Variable $Y$ (dependent variable) |
| $\beta_{1}$ | Regression parameter |
| $\beta_{0}$ | The constant (regression intercept) |
| $\widehat{Y}$ | The predicted value of Y |
| $\widehat{\beta}$ | The estimate of beta |
| $\varepsilon_{\mathrm{i}}$ | The error term (also called the disturbance term) |
| $e_{i}$ | The estimate of the error term (alternative notation: $\widehat{\varepsilon}_{i}$ ) |
| $R^{2}$ | Explained variance of a regression equation (coefficient of determination) |
| $\bar{R}^{2}$ | R-square adjusted. Also expressed as $R_{a d j}^{2}$ |
| $T S S$ | Total sum of squares (total variance) |
| $R S S$ | Residual sum of squares (residual variance) |
| $E S S$ | Explained sum of squares (explained variance) |
| $F$ | Test-statistic for the $F$ distribution |
| $H_{0}$ | Null hypothesis |
| $H_{i}$ | Alternative hypothesis, where $i$ is any whole number $(1,2,3, \ldots i)$ |
| $n$ | Sample size |
| $O L S$ | Ordinary lease squares estimation |
| $\|t\|$ | Absolute value of the test-statistic for the $t$-distribution |
| $C o r r(X, Y)$ | Correlation between variable $X$ and variable $Y$ |
| $k$ | $k$ independent variables $\left(X_{1}, X_{2}, \ldots X_{k}\right)$ |
| i | The ith observation |
Regression analysis is one of many statistical methods used to evaluate the relationship between one or more so-called independent variables X_(1),X_(2),dotsX_(k)X_{1}, X_{2}, \ldots X_{\mathrm{k}} and a dependent variable YY. Specifically, regression analysis assesses how changes in the independent variables explain changes in the dependent variable. For example, regression can be used to study: 回归分析是众多统计方法中的一种,用于评估一个或多个所谓的自变量 X_(1),X_(2),dotsX_(k)X_{1}, X_{2}, \ldots X_{\mathrm{k}} 与因变量 YY 之间的关系。具体来说,回归分析评估自变量的变化如何解释因变量的变化。例如,回归可用于研究:
Patient satisfaction (Y)(Y), and the independent variables physician communication (X_(1))\left(X_{1}\right) and service level during hospital stay (X_(2))\left(X_{2}\right). 患者满意度 (Y)(Y) ,自变量医生沟通 (X_(1))\left(X_{1}\right) 和住院期间服务水平 (X_(2))\left(X_{2}\right) 。
Demand for a product (Y)(Y), and the independent variables price (X_(1))\left(X_{1}\right), advertising (X_(2))\left(X_{2}\right), and market share (X_(3))\left(X_{3}\right). 产品需求量 (Y)(Y) ,自变量价格 (X_(1))\left(X_{1}\right) 、广告 (X_(2))\left(X_{2}\right) 和市场份额 (X_(3))\left(X_{3}\right) 。
Change in blood pressure (Y)(Y), and hypertension medication (X_(1))\left(X_{1}\right). 血压变化 (Y)(Y) 和高血压药物 (X_(1))\left(X_{1}\right) 。
Crop yield (Y)(Y), and rainfall (X_(1))\left(X_{1}\right), air temperature (X_(2))\left(X_{2}\right), and soil nitrogen content (X_(3))\left(X_{3}\right). 作物产量 (Y)(Y) ,以及降雨量 (X_(1))\left(X_{1}\right) 、气温 (X_(2))\left(X_{2}\right) 和土壤含氮量 (X_(3))\left(X_{3}\right) 。
While these examples imply a cause-effect relationship from the independent ( XX ) variables to the dependent (Y)(Y) variable, regression itself does not prove causality. It simply tests whether the modeled correlations are significantly different from zero. 虽然这些例子意味着自变量( XX )与因变量 (Y)(Y) 之间存在因果关系,但回归本身并不能证明因果关系。它只是检验建模的相关性是否与零有显著差异。
Fig. 10.1 Research model for factors affecting hotel revenue 图 10.1 酒店收入影响因素研究模型
For example, we believe that occupancy rates, price, advertising, competition, and service level determine the amount of revenue at a hotel (see Fig. 10.1). We cannot prove this, but with regression, we can test for significant relationships between the five independent variables and the dependent variable, revenue. If we find significant relationships, we can predict revenue for different values of each independent variable. That is, how much would a change in occupancy, price, advertising, competition, or service change revenue. Causality is based on robust argumentation and theory and supported (or refuted) by evidence. 例如,我们认为入住率、价格、广告、竞争和服务水平决定了酒店的收入额(见图 10.1)。我们无法证明这一点,但通过回归,我们可以检验五个自变量与因变量收入之间是否存在显著关系。如果我们找到了重要关系,就可以预测每个自变量的不同值带来的收益。也就是说,入住率、价格、广告、竞争或服务的变化会对收入产生多大的影响。因果关系建立在可靠的论证和理论基础上,并有证据支持(或反驳)。
We expect that occupancy, advertising, and service will have a positive effect on revenue, while price and competition are expected to have a negative effect. This means that if the occupancy rate goes up, advertising is increased, or service is improved, then revenue will increase. Whereas if the price goes up or the number of competitors increases, revenue will decrease. The choice of independent variables should be logical, and if possible, theoretically justified. We mean that it should be possible to argue for each variable based on common sense, and when appropriate, established theory. 我们预计,入住率、广告和服务会对收入产生积极影响,而价格和竞争则会产生消极影响。这意味着,如果入住率上升、广告增加或服务改善,那么收入就会增加。而如果价格上涨或竞争者数量增加,收入就会减少。自变量的选择应合乎逻辑,如有可能,还应具有理论依据。我们的意思是,应该能够根据常识对每个变量进行论证,在适当的时候,也可以根据已有的理论进行论证。
Most often, the relationship between the dependent and independent variables is assumed to be linear (a straight line). In our example, this means that revenue is assumed to be a linear function of the independent variables. It is important to understand that the five independent variables do not account for everything that will cause revenue to rise or fall. To account for all the other possible things that could affect revenue, we include an error term in the regression equation. It is often symbolized by the Greek letter epsilon (epsi)(\varepsilon). This is the theoretical regression model for the model in Fig. 10.1: 通常情况下,因变量和自变量之间的关系被假定为线性关系(一条直线)。在我们的例子中,这意味着假设收入是自变量的线性函数。重要的是要明白,这五个自变量并不包含所有会导致收入增加或减少的因素。为了考虑所有其他可能影响收入的因素,我们在回归方程中加入了误差项。误差项通常用希腊字母 Epsilon (epsi)(\varepsilon) 表示。这就是图 10.1 中模型的理论回归模型:
Note that we use plus signs (+) when expressing the general regression equation, even though we expect price and competition to have a negative relationship with revenue. When the beta coefficients are estimated, if they are negative, they will have the proper relationship with the dependent variable (a negative and a positive equal a negative). 请注意,在表达一般回归方程时,我们使用了加号 (+),尽管我们预计价格和竞争与收入呈负相关。在估计贝塔系数时,如果它们是负数,则它们与因变量之间会有适当的关系(负数和正数等于负数)。 beta_(0)\beta_{0} (beta null) is called the constant or sometimes the YY-intercept. It indicates where the regression line intercepts the Y -axis in a two-dimensional plane. The beta_(0)\beta_{0} (β 空)称为常数,有时也称为 YY -截距。它表示回归线在二维平面上与 Y 轴的截距。常数
constant beta_(0)\beta_{0} is what the value of the dependent variable YY (revenue) would be if all the independent variables (occupancy, price, advertising, competition, and service) were equal to zero. Often, the constant is pragmatically meaningless, only making sense from a mathematical perspective. Imagine using height to predict a person’s body weight. You collect data from 50 respondents measuring their height and weight. In a regression, when height is zero, weight will almost certainly be a nonzero number on the YY-axis. In reality, if height could be zero (which is impossible), weight could not be anything but zero. Keep in mind that you must include the constant in a regression equation, even when it does not have practical meaning. 常数 beta_(0)\beta_{0} 是指如果所有自变量(占用率、价格、广告、竞争和服务)都等于零,因变量 YY (收入)的值。通常情况下,常数没有实际意义,只有从数学角度来看才有意义。想象一下用身高来预测一个人的体重。您收集了 50 位受访者测量身高和体重的数据。在回归中,当身高为零时,体重几乎肯定会是 YY 轴上的非零数字。实际上,如果身高可能为零(这是不可能的),那么体重也不可能为零。请记住,您必须在回归方程中包含常数,即使它没有实际意义。
The betas ( beta_(i)\beta_{i}, where i=1,2,3,4,5i=1,2,3,4,5 ) in front of each independent variable are called the beta coefficients, and epsi\varepsilon is the error term. The right side of the equation (revenue is on the left side) can be divided into two parts. The error term represents the unexplained part, and the independent variables represent the explained part. Note, that even though we call it the error term, it does not necessarily refer to errors. Sometimes, it is called the disturbance term or the residual term. It must be included in the equation to represent the variance that is left over (i.e., not explained by the independent variables). The ambition with regression is to have the unexplained part epsi\varepsilon as small as possible. In other words, we want a regression equation with as much explanatory power as possible. R^(2)R^{2} is a measure of the explained variance in regression. 每个自变量前面的贝塔系数( beta_(i)\beta_{i} ,其中 i=1,2,3,4,5i=1,2,3,4,5 )称为贝塔系数, epsi\varepsilon 为误差项。等式的右边(收入在左边)可分为两部分。误差项代表未解释部分,自变量代表已解释部分。请注意,尽管我们称之为误差项,但它并不一定指误差。有时,它被称为干扰项或残差项。方程中必须包含它,以表示剩余的方差(即未被自变量解释的方差)。回归的目的是使未解释部分 epsi\varepsilon 越小越好。换句话说,我们希望回归方程具有尽可能大的解释力。 R^(2)R^{2} 是回归中解释方差的度量。
The regression coefficients beta_(i)(i=1,2,3,4,5)\beta_{i}(i=1,2,3,4,5) indicate the “isolated” effect that each independent variable has on the dependent variable, revenue. For example, all else equal, beta_(3)\beta_{3} will tell how much effect a one-unit change in advertising will have on revenue. All else equal means that we assume the other independent variables (occupancy, price, competition, and service) are kept constant. If, for example, the beta coefficient beta_(3)\beta_{3} turns out to be 2.09, and advertising and revenue are measured in millions of kronor (the unit of measurement), then increasing advertising efforts by one million kronor will result in an expected revenue increase of 2,090,000, provided that the other variables are not changed. This is the predicted change in revenue. 回归系数 beta_(i)(i=1,2,3,4,5)\beta_{i}(i=1,2,3,4,5) 表示每个自变量对因变量收入的 "孤立 "影响。例如,在其他条件不变的情况下, beta_(3)\beta_{3} 表示广告费变化一个单位会对收入产生多大影响。其他条件不变是指我们假设其他自变量(占用率、价格、竞争和服务)保持不变。例如,如果贝塔系数 beta_(3)\beta_{3} 结果为 2.09,而广告和收入都以百万克朗(计量单位)为单位,那么在其他变量不变的情况下,增加 100 万克朗的广告投入将导致预期收入增加 209 万克朗。这就是预测的收入变化。
Regression is common across many subjects and disciplines: finance, medicine, psychology, agriculture, and even sports. Movie tip: Based on a true story, watch the movie Moneyball with Brad Pitt and Jonah Hill. Billy Beane (Pitt) hires Paul DePodesta (Hill) to predict which over-valued players to trade away and which under-valued players to acquire. Based on solid data analysis, including regression, they built an inexpensive baseball team that won a record 20 straight games in the American baseball league’s 2002 season. Arguably, they changed how baseball is managed through data analysis. 回归在许多科目和学科中都很常见:金融、医学、心理学、农业,甚至体育。电影提示:根据真实故事改编,请观看布拉德-皮特(Brad Pitt)和乔纳-希尔(Jonah Hill)主演的电影《金钱球》(Moneyball)。比利-比恩(Billy Beane,皮特饰)雇佣保罗-德波德斯塔(Paul DePodesta,希尔饰)预测哪些价值过高的球员需要交易,哪些价值过低的球员需要收购。基于扎实的数据分析,包括回归分析,他们建立了一支物美价廉的棒球队,在美国棒球联盟 2002 赛季中创纪录地取得了 20 连胜。可以说,他们通过数据分析改变了棒球的管理方式。
10.2 Simple Regression Analysis 10.2 简单回归分析
Imagine a stochastic (random) process, which includes a dependent variable YY and several independent variables X_(1),X_(2),X_(3),dots,X_(k)X_{1}, X_{2}, X_{3}, \ldots, X_{k}. We could write Y=F(X_(1),X_(2),X_(3),dots:}Y=F\left(X_{1}, X_{2}, X_{3}, \ldots\right.{:X_(k))+epsi\left.X_{k}\right)+\varepsilon, which means that YY is a function of the independent variables 想象一个随机(随机)过程,其中包括一个因变量 YY 和几个自变量 X_(1),X_(2),X_(3),dots,X_(k)X_{1}, X_{2}, X_{3}, \ldots, X_{k} 。我们可以把 Y=F(X_(1),X_(2),X_(3),dots:}Y=F\left(X_{1}, X_{2}, X_{3}, \ldots\right. 写成 {:X_(k))+epsi\left.X_{k}\right)+\varepsilon ,这意味着 YY 是自变量的函数
Fig. 10.2 Simple regression line 图 10.2 简单回归线
Fig. 10.3 Simple regression model 图 10.3 简单回归模型
(X_(i))\left(X_{i}\right) plus the random term. Given the extreme unlikelihood of knowing all the independent variables in a function, and the unlikelihood of having a perfect set of population data, the true model is most often unknown. Instead, we try to identify the most important independent variables and measure them (plus the dependent variable) in a sample. In its simplest form, simple regression analysis estimates the relationship of one independent variable on one dependent variable. The theoretical model may be expressed as follows: (X_(i))\left(X_{i}\right) 加上随机项。由于极不可能知道函数中的所有自变量,也不可能有一组完美的人口数据,因此真正的模型往往是未知的。相反,我们试图找出最重要的自变量,并在样本中测量它们(加上因变量)。简单回归分析的最简单形式是估计一个自变量与一个因变量之间的关系。理论模型可表述如下:
Where YY is the dependent variable and XX is the independent variable. beta_(0)\beta_{0} is the constant (also the YY-axis intercept) and beta_(1)\beta_{1} is the slope of the regression line. epsi\varepsilon represents the error term. The explained part beta_(0)+beta_(1)X_(1)\beta_{0}+\beta_{1} X_{1}, can be represented graphically as a straight line in a coordinate system where the constant beta_(0)\beta_{0} is the YY intercept for the line and beta_(1)\beta_{1} is the slope of the line (see Fig. 10.2). The beta parameters beta_(0)\beta_{0} and beta_(1)\beta_{1} are what we want to estimate with the regression equation. Given that we are not including X_(2),X_(3),X_(4),dotsX_(k)X_{2}, X_{3}, X_{4}, \ldots X_{k}, in the equation, their missing contribution to the equation is captured by the error term epsi\varepsilon. Other variation in the error term can come from measurement error, an incorrect functional form (e.g., should the regression line actually be curvilinear), or pure coincidence. 其中, YY 是因变量, XX 是自变量。 beta_(0)\beta_{0} 是常数(也是 YY 轴截距), beta_(1)\beta_{1} 是回归线的斜率。 epsi\varepsilon 代表误差项。解释部分 beta_(0)+beta_(1)X_(1)\beta_{0}+\beta_{1} X_{1} 可以用坐标系中的直线来表示,其中常数 beta_(0)\beta_{0} 是直线的 YY 截距, beta_(1)\beta_{1} 是直线的斜率(见图 10.2)。贝塔参数 beta_(0)\beta_{0} 和 beta_(1)\beta_{1} 是我们想用回归方程估算的。鉴于我们没有将 X_(2),X_(3),X_(4),dotsX_(k)X_{2}, X_{3}, X_{4}, \ldots X_{k} 包括在方程中,它们对方程的缺失贡献被误差项 epsi\varepsilon 所捕捉。误差项的其他变化可能来自测量误差、不正确的函数形式(例如,回归线实际上应该是曲线)或纯粹的巧合。
10.3 Estimating Regression Parameters 10.3 估计回归参数
We use the Hotel data to show an example of simple regression (see Fig. 10.3). 我们用酒店数据来展示一个简单回归的例子(见图 10.3)。
The hypothesis could be written: H_(1)H_{1} : Occupancy has a positive effect on revenue. 假设可以这样写 H_(1)H_{1} :入住率对收入有积极影响。
In other words, we believe that the more rooms occupied in the hotel, the higher the hotel’s revenue, measured in kronor. In general, when writing hypotheses in 换句话说,我们认为酒店客房入住率越高,酒店收入(以克朗计算)就越高。一般来说,在撰写假设时
Fig. 10.4 Simple regression line with residuals and outlier 图 10.4 带有残差和离群值的简单回归线
research reports, the alternative hypothesis is presented and the null hypothesis is implicit. 在研究报告中,会提出替代假设,并隐含零假设。
With the Hotel data, imagine that managers have been asked to provide occupancy data and revenue data. Occupancy and revenue are unlikely to be perfectly related. That is, a specific occupancy does not lead to exactly the same revenue in every hotel. Figure 10.4 shows how the data (the dots) may be represented in a two-dimensional plane. The data in this example are randomly spread in a somewhat linear fashion with a positive slope. The relationship between the two variables can be represented by the regression line that intersects them. The goal of the regression is to estimate the most efficient line intersecting the data. We are demonstrating ordinary least squares (OLS) regression, which is one of the most common estimation methods. The residuals are the distances from the data points to the regression line. They represent the error (unexplained variance) in the equation. If the data is more spread, the resdiduals get bigger, and the regression has more unexplained variance. We have also included one piece of data called an outlier. This would be a hotel that, despite very low occupancy, has very high revenue. Outliers often have an inordinately large influence on the parameter estimates, so they must be evaluated and possibly removed from the data. Perhaps the outlier is a luxury hotel catering to a few very wealthy clients. You would have to ask yourself whether it is appropriate to it keep in the dataset. If, for example, you are trying to draw general conclusions about typical hotels, it may be best to remove the luxury hotels from the sample. 通过酒店数据,可以想象管理者被要求提供入住率数据和收入数据。入住率和收入不太可能完全相关。也就是说,特定的入住率并不会给每家酒店带来完全相同的收益。图 10.4 显示了如何在二维平面上表示数据(点)。本例中的数据是以某种线性方式随机分布的,斜率为正。两个变量之间的关系可以用它们相交的回归线来表示。回归的目的是估算出与数据相交的最有效直线。我们正在演示普通最小二乘法(OLS)回归,这是最常用的估计方法之一。残差是数据点到回归线的距离。它们代表方程中的误差(无法解释的方差)。如果数据比较分散,残差就会变大,回归的未解释方差也会变大。我们还加入了一个称为离群值的数据。这指的是一家酒店,尽管入住率很低,但收入却很高。异常值通常会对参数估计产生过大的影响,因此必须对其进行评估,并在可能的情况下将其从数据中剔除。也许异常值是一家豪华酒店,只接待少数非常富有的客户。您必须问问自己,将其保留在数据集中是否合适。例如,如果您正试图对典型酒店得出一般性结论,那么最好将豪华酒店从样本中剔除。
To run the regression analysis in SPSS with the Hotel data, choose: Analyze >> regression >> Linear. Put Revenue in the Dependent box and Occupancy in the Independent(s) box, Click OK (see Fig. 10.5). 要在 SPSS 中对酒店数据进行回归分析,请选择分析 >> 回归 >> 线性。在 "因变量 "框中填入 "收入",在 "自变量 "框中填入 "入住率",单击确定(见图 10.5)。
The Model Summary table (see Table 10.1) shows R^(2)R^{2} and R_(adj)^(2)R_{a d j}^{2}. They are sometimes referred to as the coefficient of determination. They represent two measures of the explained variance in the regression equation. R_(adj)^(2)R_{a d j}^{2} is based on R^(2)R^{2}, with the additional element of taking into account the number of independent variables. Parsimony is a virtue in regression models. In general, you should aim to find the simplest possible regression model to explain the most amount of variance in the dependent variable. From this perspective, adding more independent variables to explain slightly more variance is counterproductive. R_(adj)^(2)R_{a d j}^{2} includes a 模型汇总表(见表 10.1)显示了 R^(2)R^{2} 和 R_(adj)^(2)R_{a d j}^{2} 。它们有时被称为判定系数。它们是回归方程中解释方差的两个量度。 R_(adj)^(2)R_{a d j}^{2} 以 R^(2)R^{2} 为基础,另外考虑了自变量的数量。解析性是回归模型的一个优点。一般来说,您应该力求找到最简单的回归模型,以解释因变量中最大的方差。从这个角度来看,增加更多的自变量来解释更多的方差只会适得其反。 R_(adj)^(2)R_{a d j}^{2} 包括一个
Fig. 10.5 Regression dialogue box 图 10.5 回归对话框
Table 10.1 Model summary: R^(2)R^{2} and R_(adj)^(2)R_{a d j}^{2} 表 10.1 模型摘要: R^(2)R^{2} 和 R_(adj)^(2)R_{a d j}^{2}
Model summary 模型概要
Model 模型
R
R square R 平方
Adjusted R square 调整后的 R 平方
Std. error of the estimate 估计值的标准误差
1
0.335^(a)0.335^{\mathrm{a}}
0.112
0.103
19.212
Model summary
Model R R square Adjusted R square Std. error of the estimate
1 0.335^(a) 0.112 0.103 19.212 | Model summary | | | | | |
| :--- | :--- | :--- | :--- | :--- | :---: |
| Model | R | R square | Adjusted R square | Std. error of the estimate | |
| 1 | $0.335^{\mathrm{a}}$ | 0.112 | 0.103 | 19.212 | |
Table 10.2 ANOVA: df,F&Sigd f, F \& \operatorname{Sig} 表 10.2 方差分析: df,F&Sigd f, F \& \operatorname{Sig}
ANOVA ^(a)^{\mathrm{a}} 方差分析 ^(a)^{\mathrm{a}}
Model 模型
Sum of squares 平方和
df
Mean square 均方
F
Sig.
1
Regression 回归
4579.882
1
4579.882
12.408
0.001^("b ")0.001^{\text {b }}
Residual 剩余
36,173.10836,173.108
98
359.113
Total 总计
40,752.99040,752.990
99
ANOVA ^(a)
Model Sum of squares df Mean square F Sig.
1 Regression 4579.882 1 4579.882 12.408 0.001^("b ")
Residual 36,173.108 98 359.113
Total 40,752.990 99 | ANOVA $^{\mathrm{a}}$ | | | | | | |
| :--- | :--- | :---: | :---: | :---: | :--- | :--- |
| Model | | Sum of squares | df | Mean square | F | Sig. |
| 1 | Regression | 4579.882 | 1 | 4579.882 | 12.408 | $0.001^{\text {b }}$ |
| | Residual | $36,173.108$ | 98 | 359.113 | | |
| | Total | $40,752.990$ | 99 | | | |
^(a){ }^{\mathrm{a}} Dependent variable: Revenue ^(a){ }^{\mathrm{a}} 因变量:收入 ^(b){ }^{\mathrm{b}} Predictors: (Constant), Occupancy ^(b){ }^{\mathrm{b}} 预测因子:(常数),占用率
penalty for each additional independent variable. If the contribution to the explained variance is minimal, R_(adj)^(2)R_{a d j}^{2} will drop with the addition of the variable. The table shows a fairly low R^(2)R^{2} of 11.2%11.2 \%. This means that 88.8%88.8 \% of the variance is unexplained. 的惩罚。如果对解释方差的贡献很小, R_(adj)^(2)R_{a d j}^{2} 就会随着变量的增加而下降。该表显示, R^(2)R^{2} 与 11.2%11.2 \% 的比率相当低。这意味着 88.8%88.8 \% 的方差是无法解释的。
Table 10.2 shows the ANOVA table output. The degrees of freedon ( dfd f ) are used for referring to the tt-tables and FF-Tables. FF shows the test-statistic for the FF-tables. Sig. is the significance probablility, which is also referred to as the p-value. Since 表 10.2 显示了方差分析表的输出结果。自由度 ( dfd f ) 用于指代 tt 表和 FF 表。 FF 表示 FF 表的检验统计量。Sig. 是显著性概率,也称为 p 值。因为
^("a "){ }^{\text {a }} Dependent variable: Revenue ^("a "){ }^{\text {a }} 因变量:收入 0.001 < 0.050.001<0.05, the regression equation fits the data well. It also means that the R^(2)R^{2} is significantly different from zero. We discuss these in more detail later in the chapter. 0.001 < 0.050.001<0.05 ,回归方程与数据拟合良好。这也意味着 R^(2)R^{2} 与零有显著差异。我们将在本章稍后部分详细讨论这些问题。
Table 10.3 shows the output for the regression coefficients. The constant (beta_(0))\left(\beta_{0}\right) is -7.726 , indicating where the regression line intercepts the YY-axis. It makes sense that revenue (Y)(Y) is negative when occupancy reaches zero. Hotels have costs despite having no guests. The unstandardized beta coefficient for occupancy (beta_(1))\left(\beta_{1}\right) is 0.49 , meaning that the slope of the regression line is positive. Higher occupancy leads to higher revenue. The significance probability of 0.001 means that occupancy has a statistically significant effect on revenue. 表 10.3 显示了回归系数的输出结果。常数 (beta_(0))\left(\beta_{0}\right) 为 -7.726 ,表示回归线与 YY 轴的截距。因此,当入住率为零时,收入 (Y)(Y) 为负值是合理的。尽管没有客人,酒店还是有成本的。入住率 (beta_(1))\left(\beta_{1}\right) 的非标准化贝塔系数为 0.49,这意味着回归线的斜率为正。入住率越高,收入越高。显著性概率为 0.001,意味着入住率对收入有显著的统计影响。
The regression equation can be expressed as: 回归方程可表示为
Which is the estimated model for observation ii, where Y=Y= Revenue and X=X= Occupancy. A one-unit change in occupancy (one room rented) will increase revenue with 0.49 kronor (measured in thousands). 这是观测值 ii 的估计模型,其中 Y=Y= 为收入, X=X= 为入住率。入住率每变化一个单位(租用一个房间),收入将增加 0.49 克朗(以千为单位)。
We can create a plot of the regression line by choosing: Graphs >> Chart Builder >> then choose (1) Scatter/Dot, then (2) drag the simple scatterplot up to the open box, then (3) drag Revenue to the YY box and Occupancy to the XX box. Click OKO K (see Fig. 10.6). 我们可以创建回归线的曲线图,方法是选择Graphs >> Chart Builder >> ,然后选择 (1)散点图/点图,然后 (2) 将简单散点图拖到打开的框中,然后 (3) 将 Revenue 拖到 YY 框中,将 Occupancy 拖到 XX 框中。单击 OKO K (见图 10.6)。
Double click on the graph in the output and then click on the fit line at total icon (see Fig. 10.7). This adds the regression line to the graph (see Fig. 10.8). Finally, to get the proper perspective, click on the XX and make the minimum zero, and click on YY and make the minimum - 10 . 双击输出中的图形,然后点击总图标上的拟合线(见图 10.7)。这将在图形中添加回归线(见图 10.8)。最后,为了获得正确的视角,点击 XX 并将最小值设为零,点击 YY 并将最小值设为 -10。
In Fig. 10.8, you can see the regression line for predicting values of revenue. The YY-intercept (constant beta_(0)\beta_{0} ) is -7.77 and the slope (beta_(1))\left(\beta_{1}\right) of the line is 0.49 . Compared to the example in Fig. 10.4, the data is more randomly spread around the graph. We have marked one of the residuals to show that in some cases they are very large. Accounting for the size of the residuals, it is not surprising that the explained variance ( R^(2)R^{2} ) is only 11.2%11.2 \%. 在图 10.8 中,您可以看到预测收入值的回归线。 YY -截距(常数 beta_(0)\beta_{0} )为 -7.77,直线的斜率 (beta_(1))\left(\beta_{1}\right) 为 0.49。与图 10.4 中的示例相比,数据在图形中的分布更加随机。我们标出了一个残差,以显示在某些情况下残差非常大。考虑到残差的大小,解释方差 ( R^(2)R^{2} ) 只有 11.2%11.2 \% 也就不足为奇了。
The most common estimation method for regression is ordinary least squares (OLS) estimation. The idea is to find the estimates where the sum of the squared residuals is as small as possible. The formulas for the beta estimates are: 最常见的回归估计方法是普通最小二乘法(OLS)估计。其原理是找到残差平方和尽可能小的估计值。贝塔估计值的公式为 widehat(beta)=(sum xy)/(sumx^(2))\widehat{\beta}=\frac{\sum x y}{\sum x^{2}} and widehat(beta)_(0)= bar(Y)- widehat(beta) bar(X)\widehat{\beta}_{0}=\bar{Y}-\widehat{\beta} \bar{X}widehat(beta)=(sum xy)/(sumx^(2))\widehat{\beta}=\frac{\sum x y}{\sum x^{2}} 和 widehat(beta)_(0)= bar(Y)- widehat(beta) bar(X)\widehat{\beta}_{0}=\bar{Y}-\widehat{\beta} \bar{X}
Defined this way, xx and yy are often called the deviation scores. 按照这种定义, xx 和 yy 通常被称为偏差分数。
Fig. 10.6 Regression scatterplot 图 10.6 回归散点图
Fig. 10.7 Fit line at total icon 图 10.7 图标总数的拟合线
x=X- bar(X)" and "y=Y- bar(Y)x=X-\bar{X} \text { and } y=Y-\bar{Y}
Where bar(X)\bar{X} and bar(Y)\bar{Y} are the respective mean values. Visually, this means determining the slope of the regression line so that the sum of the squared residuals (estimated error) is minimal. The error term ( epsi\varepsilon ) can be calculated/estimated by the formula: 其中 bar(X)\bar{X} 和 bar(Y)\bar{Y} 分别为各自的平均值。直观地说,这意味着确定回归线的斜率,使残差平方和(估计误差)最小。误差项 ( epsi\varepsilon ) 可用公式计算/估计:
This difference is called the residual. widehat(beta)_(0)+ widehat(beta)_(i)\widehat{\beta}_{0}+\widehat{\beta}_{i} are usually symbolized by widehat(Y)_(i)\widehat{Y}_{i}, such that widehat(Y)_(i)= widehat(beta)_(0)+ widehat(beta)X_(i)\widehat{Y}_{i}=\widehat{\beta}_{0}+\widehat{\beta} X_{i}. In other words, the estimate of the error term e_(i)e_{i} is equal to YY minus the estimate for widehat(Y)\widehat{Y}, which is e_(i)=Y_(i)- widehat(Y)_(i)e_{i}=Y_{i}-\widehat{Y}_{i}. Note that we have shown e_(i)e_{i} to specifically mean that it is the predicted or estimated error value. An alternative notation would be widehat(epsi)\widehat{\varepsilon}. 这一差值称为残差。 widehat(beta)_(0)+ widehat(beta)_(i)\widehat{\beta}_{0}+\widehat{\beta}_{i} 通常用 widehat(Y)_(i)\widehat{Y}_{i} 表示,这样 widehat(Y)_(i)= widehat(beta)_(0)+ widehat(beta)X_(i)\widehat{Y}_{i}=\widehat{\beta}_{0}+\widehat{\beta} X_{i} 。换句话说,误差项 e_(i)e_{i} 的估计值等于 YY 减去 widehat(Y)\widehat{Y} 的估计值,即 e_(i)=Y_(i)- widehat(Y)_(i)e_{i}=Y_{i}-\widehat{Y}_{i} 。请注意,我们显示 e_(i)e_{i} 是为了明确表示它是预测或估计的误差值。另一种表示方法是 widehat(epsi)\widehat{\varepsilon} 。
Example 10.1 例 10.1
Assume that the relationship between the price of apples (X)(X) and per kilo sales (Y)(Y) is linear. This gives the theoretical regression model: 假设苹果价格 (X)(X) 与每公斤销售量 (Y)(Y) 之间是线性关系。这就是理论回归模型:
The data in Table 10.4 was collected over a period of 10 days. Each day shows how many kilos of apples were sold and at what price (in kronor). The SPSS file is called Apple sales. 表 10.4 中的数据是在 10 天内收集的。每天都显示了苹果的销售量和价格(单位:克朗)。SPSS 文件名为苹果销售量。
Using OLS estimation, the estimated model for observation ii is: 通过 OLS 估计,观测值 ii 的估计模型为
The predicted value of widehat(Y)\widehat{Y} (sales), for observation ii, is computed by the following equation: 对于观测值 ii , widehat(Y)\widehat{Y} (销售额)的预测值由以下公式计算得出:
Fig. 10.9 Regression plot for apple sales 图 10.9 苹果销售额回归图
Note that we include a widehat(hat)\widehat{h a t} over the parameter/variable that we predict. Assume that the price of apples is 12 kroner today. An increase of 2 kronor will reduce sales by -1.55**2=-3.1-1.55 * 2=-3.1 kilos. Figure 10.9 shows the graph of the regression line. With a negative beta coefficient for the XX variable, the slope is negative. 请注意,我们在预测的参数/变量上加入了 widehat(hat)\widehat{h a t} 。假设今天苹果的价格是 12 克朗。价格上涨 2 克朗,销售量将减少 -1.55**2=-3.1-1.55 * 2=-3.1 公斤。图 10.9 显示了回归线的图形。由于 XX 变量的贝塔系数为负,因此斜率为负。
The constant ( YY-intercept) suggests that when the price is zero, sales will be 50.3 kilos. This is an excellent example of what we mentioned earlier about the absurd values that the intercept can have. Our “theory of apple pricing” probably holds above a certain minimum price and below some maximum price. This is the same for many theories. They apply within a range of values. However, sometimes the constant is important from a theoretical perspective so do not necessarily ignore it, and it must always be included in OLS regression. 常数( YY -截距)表明,当价格为零时,销量为 50.3 公斤。这是一个很好的例子,说明了我们前面提到的截距可能具有的荒谬值。我们的 "苹果定价理论 "在高于某个最低价格和低于某个最高价格时都可能成立。许多理论都是如此。它们在一定的取值范围内适用。不过,从理论角度来看,常数有时也很重要,因此不一定要忽略它,而且在 OLS 回归中必须始终包含常数。
Example 10.2 例 10.2
In this example, the constant is very important. Assume that we estimate a regression model for the gender pay gap (Y)(Y) and seniority (X)(X), for men and women employed at a consultancy. The regression equation for the predicted value of widehat(Y)\widehat{Y} for observation ii is: 在本例中,常数非常重要。假设我们对某咨询公司男女员工的性别薪酬差距 (Y)(Y) 和资历 (X)(X) 进行了回归模型估计。观测值 ii 的预测值 widehat(Y)\widehat{Y} 的回归方程为:
Y is measured as the difference in pay for the sum of all women minus the sum for all men for each year of seniority. In this example, the constant means that in year zero, when a person is hired, women get, on average, 10 thousand kronor per Y 的衡量标准是所有女性的薪酬总和减去所有男性的薪酬总和,即每一年工龄的薪酬差额。在这个例子中,常数表示在第零年,即一个人被雇用的时候,女性平均每人获得 1 万克朗。
year more than men. Both groups increase their earnings by 5500 kronor for each year of employment. 比男性多一年。两组人每就业一年,收入都会增加 5500 克朗。
10.4 The TT-test 10.4 TT 测试
Imagine that management at a hotel decides to increase service in an effort to increase revenue. They implement a large (expensive) project that includes substantial training of all employees. After implementation, the managers see only small changes in revenue. This shows that strategic choices can have serious financial consequences. To better understand the implementation and its effect on revenue, the managers could run regression analysis. With the variables revenue (Y)(Y) and implementation costs (X)(X), they can test whether there is a statistically significant relationship between the implementation and revenue. Specifically, they test whether the regression slope coefficient beta_(1)\beta_{1} is significantly different from zero. In OLS regression, they look at the tt-test of the regression coefficient(s) (we discuss the FF-test and multiple regression later in the chapter). This would show whether the implementation (measured as its cost) affects revenue. The null hypothesis for the slope coefficient is: 设想一家酒店的管理层决定提高服务质量,以增加收入。他们实施了一个大型(昂贵)项目,其中包括对所有员工进行大量培训。项目实施后,经理们发现收入只发生了很小的变化。这表明,战略选择可能会带来严重的财务后果。为了更好地了解项目实施及其对收入的影响,经理们可以进行回归分析。通过收入 (Y)(Y) 和实施成本 (X)(X) 变量,他们可以检验实施与收入之间是否存在统计意义上的显著关系。具体来说,他们要检验回归斜率系数 beta_(1)\beta_{1} 是否与零有显著差异。在 OLS 回归中,他们会查看回归系数的 tt 检验(我们将在本章后面讨论 FF 检验和多元回归)。这将说明实施(以成本衡量)是否会影响收入。斜率系数的零假设是 H_(0):beta_(1)=0\mathrm{H}_{0}: \beta_{1}=0, which can also be expressed as: Y_(i)=beta_(0)+0**X_(i)+epsi_(i)=beta_(0)+epsi_(i)Y_{i}=\beta_{0}+0 * X_{i}+\varepsilon_{i}=\beta_{0}+\varepsilon_{i} H_(0):beta_(1)=0\mathrm{H}_{0}: \beta_{1}=0 ,也可以表示为 Y_(i)=beta_(0)+0**X_(i)+epsi_(i)=beta_(0)+epsi_(i)Y_{i}=\beta_{0}+0 * X_{i}+\varepsilon_{i}=\beta_{0}+\varepsilon_{i}
In other words, the null hypothesis that YY (revenue) is equal to the constant plus random error. This is the same as saying XX has no effect on YY. The investment in service has no effect on revenue. 换句话说,零假设是 YY (收入)等于常数加随机误差。这等于说 XX 对 YY 没有影响。服务投资对收入没有影响。
The alternative is that there is a significant relationship between XX and YY. There are three possibilities: one two-sided and two one-sided. The two-sided hypothesis would be that revenue may go up or down as a result of an increase in service. Either costs substantially exceed revenues, leading to a drop in revenue. Or, the cost of the service improvement is substantially outwieghed by the resulting revenue increase. This means that from a two-sided perspective, beta_(1)\beta_{1} is significantly different from zero (!=)(\neq), either higher or lower. The two one-sided alternatives are to hypothesize that beta_(1)\beta_{1} is significantly less than zero, or that beta_(1)\beta_{1} is significantly greater than zero. 另一种情况是 XX 和 YY 之间存在显著关系。有三种可能性:一种是双面假设,两种是单面假设。双面假设是,服务增加后,收入可能增加,也可能减少。要么成本大大超过收入,导致收入下降。或者,改善服务的成本大大超过了由此带来的收入增长。这意味着,从双面角度来看, beta_(1)\beta_{1} 与零 (!=)(\neq) 有显著差异,要么更高,要么更低。两种单边选择是假设 beta_(1)\beta_{1} 显著小于零,或 beta_(1)\beta_{1} 显著大于零。
The procedure for all three options is the same, except that in the tt-tables you refer to the one-sided or two-sided critical values for a given level of significance (alpha).t(\alpha) . t tables will normally be constructed showing a range of degrees of freedom (down the left side of the table); there are one and two-sided alternatives (across the top of the table), for a given level of significance (alpha)(\alpha). For example, for large samples (over about 100), the critical cutoff value for a two-sided tt-test at a significance level of 5%5 \% is 1.96 . To determine the degrees of freedom for simple regression, the formula is n-n- 2. For multiple regression (discussed later), the formula is n-k-1n-k-1, where kk is the number of independent variables. The test-statistic for tt is calculated by dividing the estimated beta coefficient ( widehat(beta))(\widehat{\beta}) by the standard error of the beta coefficient s_( hat(beta))s_{\hat{\beta}} : 这三个选项的操作步骤都是一样的,只是在 tt 表中,您需要参考给定显著性水平下的单侧或双侧临界值 (alpha).t(\alpha) . t 表通常会显示自由度范围(位于表的左侧);在给定显著性水平 (alpha)(\alpha) 下,有单侧和双侧备选值(位于表的顶部)。例如,对于大样本(约 100 个以上),在显著性水平为 5%5 \% 时,双侧 tt 检验的临界临界值为 1.96。要确定简单回归的自由度,公式为 n-n- 2。对于多元回归(稍后讨论),公式为 n-k-1n-k-1 ,其中 kk 是自变量的个数。 tt 的检验统计量是用估计的贝塔系数 ( widehat(beta))(\widehat{\beta}) 除以贝塔系数的标准误差 s_( hat(beta))s_{\hat{\beta}} 计算得出的:
Fig. 10.10 t-distribution: 图 10.10 t 分布:
two-sided and one-sided tests 双侧和单侧测试
The Two-Sided t\boldsymbol{t}-test: Positive or Negative 双面 t\boldsymbol{t} 测试:正或负
H_(0):beta_(1)=0\mathrm{H}_{0}: \beta_{1}=0, and H_(1):beta_(1)!=0.H_(0)\mathrm{H}_{1}: \beta_{1} \neq 0 . \mathrm{H}_{0} is rejected if |t| > t_(alpha)|t|>t_{\alpha}. We then say that beta_(1)\beta_{1} is not equal to zero. The sign of beta_(1)\beta_{1} is irrelevant because it can be either positive or negative. H_(0):beta_(1)=0\mathrm{H}_{0}: \beta_{1}=0 ,如果 |t| > t_(alpha)|t|>t_{\alpha} ,则拒绝 H_(1):beta_(1)!=0.H_(0)\mathrm{H}_{1}: \beta_{1} \neq 0 . \mathrm{H}_{0} 。然后我们说 beta_(1)\beta_{1} 不等于零。 beta_(1)\beta_{1} 的符号无关紧要,因为它既可以是正数,也可以是负数。
The One-sided t\boldsymbol{t}-test: Positive 单边 t\boldsymbol{t} 检验:正
H_(0):beta_(1) <= 0\mathrm{H}_{0}: \beta_{1} \leq 0, and H_(1):beta_(1) > 0.H_(0)\mathrm{H}_{1}: \beta_{1}>0 . \mathrm{H}_{0} is rejected if |t| > t_(alpha)|t|>t_{\alpha} for a one-sided test and the sign of beta_(1)\beta_{1} is positive. Look at the tt-table, but this time you look at the one-sided alternative for a given level of significance ( alpha\alpha ). For a large sample (over 100), the critical cutoff value for a one-sided tt-test at a significance level of 5%5 \% is 1.645 . H_(0):beta_(1) <= 0\mathrm{H}_{0}: \beta_{1} \leq 0 ,如果单侧检验的 |t| > t_(alpha)|t|>t_{\alpha} 和 beta_(1)\beta_{1} 的符号为正,则拒绝 H_(1):beta_(1) > 0.H_(0)\mathrm{H}_{1}: \beta_{1}>0 . \mathrm{H}_{0} 。再看 tt 表,但这次你要看的是给定显著性水平下的单侧选择 ( alpha\alpha )。对于大样本(超过 100 个),在显著性水平为 5%5 \% 时,单侧 tt 检验的临界临界值为 1.645。
The One-sided t\boldsymbol{t}-test: Negative 单边 t\boldsymbol{t} 检验:负
H_(0):beta_(1) >= 0\mathrm{H}_{0}: \beta_{1} \geq 0, and H_(1):beta_(1) < 0.H_(0)\mathrm{H}_{1}: \beta_{1}<0 . \mathrm{H}_{0} is rejected if |t| > t_(alpha)|t|>t_{\alpha} for a one-sided test and the sign of beta_(1)\beta_{1} is negative. Exactly the same as for the one-sided (positive) tt-test above, look at the tt-table, for the one-sided alternative for a given level of significance (alpha)(\alpha). The tt table does not distinguish between positive and negative, so for a large sample (over 100), the critical cutoff value for a one-sided tt-test at a significance level of 5%5 \% is 1.645. H_(0):beta_(1) >= 0\mathrm{H}_{0}: \beta_{1} \geq 0 ,如果单侧检验的 |t| > t_(alpha)|t|>t_{\alpha} 和 beta_(1)\beta_{1} 的符号为负,则拒绝 H_(1):beta_(1) < 0.H_(0)\mathrm{H}_{1}: \beta_{1}<0 . \mathrm{H}_{0} 。与上述单侧(正向) tt 检验完全相同,请看 tt 表,用于给定显著性水平 (alpha)(\alpha) 的单侧替代检验。 tt 表不区分正负,因此对于大样本(超过 100 个),在显著性水平为 5%5 \% 时,单侧 tt 检验的临界临界值为 1.645。
Visually, the tt-distribution can be expressed as in Fig. 10.10. At each end (tail) of the distribution, there is an area designated as alpha ( alpha\alpha ), which is the significance level of the test. Two-sided tests use both tails, whereas one-sided tests use one tail. You decide the level of significance and whether it is a one-sided or two-sided test. Then, based on the degrees of freedom (nu)(\nu), you find the critical cutoff value. For each degree of freedom, the distribution curve changes slightly. 从视觉上看, tt 分布可以用图 10.10 来表示。在分布的两端(尾部),有一个区域被指定为 alpha ( alpha\alpha ),它是检验的显著性水平。双侧检验使用两个尾部,而单侧检验使用一个尾部。您可以决定显著性水平以及是单侧检验还是双侧检验。然后,根据自由度 (nu)(\nu) 找出临界分界值。对于每个自由度,分布曲线都会略有变化。
Example 10.3 例 10.3
Picking up with the apple sales again, we assume that when prices rise, sales drop. The hypotheses are: 再以苹果的销售为例,我们假设当价格上涨时,销售量会下降。假设如下 H_(0):beta_(1) >= 0\mathrm{H}_{0}: \beta_{1} \geq 0, and H_(1):beta_(1) < 0\mathrm{H}_{1}: \beta_{1}<0 H_(0):beta_(1) >= 0\mathrm{H}_{0}: \beta_{1} \geq 0 ,以及 H_(1):beta_(1) < 0\mathrm{H}_{1}: \beta_{1}<0
We use a 5% significance level. Our OLS regression estimates gave these values: 我们采用 5%的显著性水平。我们的 OLS 回归估计得出了这些值: beta_(0)=50.28\beta_{0}=50.28 and beta_(1)=-1.55\beta_{1}=-1.55beta_(0)=50.28\beta_{0}=50.28 和 beta_(1)=-1.55\beta_{1}=-1.55
We can calculate the test-statistic for tt using the formula: 我们可以使用公式计算 tt 的检验统计量:
This is confirmed by looking at the coefficients output generated in SPSS: 通过观察 SPSS 生成的系数输出,可以证实这一点:
If we want to compare the test-statistic (-4.579)(-4.579) against the critical cutoff value in the tt-tables, we take n-2=10-2=8n-2=10-2=8 degrees of freedom. This is a one-sided test because we hypothesized a negative relationship, and we decided a 5%5 \% significance level. The cutoff value from the tables is: 1.860 . Since the absolute value of |-4.579| > 1.860|-4.579|>1.860, we reject the null hypothesis. We conclude that price has a negative relationship with sales of apples. We can also draw this conclusion by simply looking at the p-value (significance probablility). Since 0.002 < 0.050.002<0.05, we reject the null hypothesis. 如果我们要将测试统计量 (-4.579)(-4.579) 与 tt 表中的临界临界值进行比较,我们需要 n-2=10-2=8n-2=10-2=8 自由度。这是一个单侧检验,因为我们假设两者之间存在负相关关系,因此决定采用 5%5 \% 显著性水平。表中的临界值为1.860 .由于 |-4.579| > 1.860|-4.579|>1.860 的绝对值,我们拒绝零假设。我们得出结论:价格与苹果的销量呈负相关。我们还可以通过简单地查看 p 值(显著性概率)得出这一结论。由于 0.002 < 0.050.002<0.05 ,我们拒绝零假设。
10.5 Multiple Regression Analysis 10.5 多元回归分析
We can extend simple regression, where there is one independent variable, to multiple regression, where there are two or more independent variables. We began the chapter suggesting our theory that revenue is a function of occupancy, price, advertising, competition, and service. We can formalize the hypotheses as: 我们可以将只有一个自变量的简单回归扩展到有两个或更多自变量的多元回归。在本章开头,我们提出了收入是入住率、价格、广告、竞争和服务的函数这一理论。我们可以将假设正式表述为
H1: Occupancy has a positive effect on revenue. H1:入住率对收入有积极影响。
H 2 : Price has a negative effect on revenue. H 2:价格对收入有负面影响。
H3: Advertising has a positive effect on revenue. H3:广告对收入有积极影响。
H4: Competition has a negative effect on revenue. H4:竞争对收入有负面影响。
H5: Service has a positive effect on revenue. H5:服务对收入有积极影响。
The model is: 模式是
We assume that the relationships are linear, which can be expressed in the following theoretical regression equation: 我们假设这些关系是线性的,可以用下面的理论回归方程来表示:
Fig. 10.11 Regression model for hotel revenue 图 10.11 酒店收入回归模型
Where Y is the dependent variable and the Xs are the k independent variables. epsi_(i)\varepsilon_{\mathrm{i}} is the error term and the ii is the i^("th ")i^{\text {th }} observation. beta_(0)\beta_{0} is the constant (Y-intercept) and beta_(i)\beta_{i}(i=1,2,3,dots,k)(i=1,2,3, \ldots, k) are the regression parameters. 其中,Y 是因变量,Xs 是 k 个自变量。 epsi_(i)\varepsilon_{\mathrm{i}} 为误差项, ii 为 i^("th ")i^{\text {th }} 观测值。 beta_(0)\beta_{0} 是常数(Y-截距), beta_(i)\beta_{i}(i=1,2,3,dots,k)(i=1,2,3, \ldots, k) 是回归参数。
Multiple regression is estimated in the same way as simple regression, with ordinary least squares (OLS) estimation. To run the multiple regression analysis in SPSS, choose: Analyze >> Regression >> Linear. Put Revenue into the Dependent box and Occupancy, Price, Advertising, Competition, and Service into the Independent (s) box (see Fig. 10.11). 多元回归的估计方法与简单回归相同,都是普通最小二乘法(OLS)估计。要在 SPSS 中运行多元回归分析,请选择分析 >> 回归 >> 线性。将 "收入 "放入 "因数 "框,将 "占用率"、"价格"、"广告"、"竞争 "和 "服务 "放入 "自数 "框(见图 10.11)。
10.6 Explained Variance 10.6 解释方差
The first table in the SPSS regression output is the Model Summary (see Table 10.5). With the R^(2)R^{2}, you can answer the question, how much of the variance in YY does the regression equation explain? SPSS 回归输出中的第一个表格是模型摘要(见表 10.5)。通过 R^(2)R^{2} 可以回答这样一个问题:回归方程解释了 YY 中多少方差?
Total variance can be broken down into two main parts: 总差异可分为两大部分:
Total variance == explained variance + unexplained variance. 总方差 == 解释方差 + 未解释方差。
Explained variance is calculated: 计算解释方差:
{:[TSS(" total sum of squares ")=" RSS "(" regression sum of squares ")],[+" ESS (error sum of squares) "]:}\begin{aligned}
T S S(\text { total sum of squares })= & \text { RSS }(\text { regression sum of squares }) \\
& + \text { ESS (error sum of squares) }
\end{aligned}
The fraction (" RSS ")/(TSS)\frac{\text { RSS }}{T S S} is called the explained variance. An alternative name is the coefficient of determination, which is abbreviated as R^(2)R^{2}. Therefore, we can write: 分数 (" RSS ")/(TSS)\frac{\text { RSS }}{T S S} 称为解释方差。另一个名称是决定系数,缩写为 R^(2)R^{2} 。因此,我们可以写成
Table 10.5 Sales and price coefficients 表 10.5 销售和价格系数
R^(2)=(RSS)/(TSS)," or "R^(2)=1-(ESS)/(TSS)R^{2}=\frac{R S S}{T S S}, \text { or } R^{2}=1-\frac{E S S}{T S S}
R^(2)R^{2} (expressed as r-square) has a value between 0 and 1 , with values closer to 1 indicating higher explained variance. For example, when R^(2)R^{2} is above 0.5 , it means that more than half of the variance in YY is explained by the regression equation and the rest is error. When R^(2)R^{2} is low, there are probably other important explanatory independent variables missing from the equation. In this situation, we could search for other independent variables that increase the explanatory power. R^(2)R^{2} (表示为 r-square)的值介于 0 和 1 之间,值越接近 1 表示解释的方差越大。例如,当 R^(2)R^{2} 高于 0.5 时,表示 YY 中一半以上的方差由回归方程解释,其余为误差。当 R^(2)R^{2} 较低时,方程中可能缺少其他重要的解释性自变量。在这种情况下,我们可以寻找其他自变量来提高解释能力。
Early in this chapter we ran a simple regression between occupancy and revenue. The R^(2)R^{2} was only 11.2%11.2 \%. That means that occupancy only explains 11.2%11.2 \% of the variance in revenue. In this situation, it would be good to consider adding independent variables. In our new model (Fig. 10.12), we include four more independent variables (price, advertising, competition, and service). The Model Summary output in Table 10.6 shows that the explained variance (R^(2))\left(R^{2}\right) has increased to 42.5%42.5 \%. While this is a vast improvement, there must be other variables we could add to substantially improve the model. Of course, we would have to have them in our dataset. 在本章前半部分,我们对入住率和收入进行了简单回归。 R^(2)R^{2} 仅为 11.2%11.2 \% 。这意味着入住率只能解释收入差异中的 11.2%11.2 \% 。在这种情况下,最好考虑增加自变量。在我们的新模型中(图 10.12),我们又加入了四个自变量(价格、广告、竞争和服务)。表 10.6 中的模型摘要输出显示,解释方差 (R^(2))\left(R^{2}\right) 已增加到 42.5%42.5 \% 。虽然这是一个巨大的进步,但我们肯定还可以添加其他变量来大幅改进模型。当然,我们的数据集中必须有这些变量。 R^(2)R^{2} is derived through the following equation: R^(2)R^{2} 通过以下公式得出:
R^(2)=(ESS)/(TSS)=(17,339.917)/(40,752.990)=0.425=42.5%R^{2}=\frac{E S S}{T S S}=\frac{17,339.917}{40,752.990}=0.425=42.5 \%
The RSS and TSS values come from the ANOVA table in Table 10.6 (below). They are: RSS 和 TSS 值来自表 10.6(如下)中的方差分析表。它们是 RSS=17,339.917R S S=17,339.917 ESS=23,413.073E S S=23,413.073 TSS=40,752.990T S S=40,752.990
An important characteristic of R^(2)R^{2} is that it always increases with the addition of independent variables. The more independent variables you include, the higher R^(2)R^{2} will be, even when the additional variables are irrelevant. This may result in an overly optimistic evaluation of a regression equation when R^(2)R^{2} is used uncritically as a quality criterion. R^(2)R^{2} 的一个重要特征是,它总是随着自变量的增加而增加。包含的自变量越多, R^(2)R^{2} 就越高,即使添加的变量无关紧要。如果不加批判地使用 R^(2)R^{2} 作为质量标准,可能会导致对回归方程的评估过于乐观。
When comparing two regression models with the same dependent variable, but with a different number of explanatory variables, you should not automatically select 在比较因变量相同但解释变量数量不同的两个回归模型时,不应自动选择
Table 10.6 Model summary for multiple regression 表 10.6 多元回归模型汇总
Model summary 模型概要
Model 模型
R
R square R 平方
Adjusted R square 调整后的 R 平方
Std. error of the estimate 估计值的标准误差
1
0.652^(a)0.652^{\mathrm{a}}
0.425
0.395
15.782
Model summary
Model R R square Adjusted R square Std. error of the estimate
1 0.652^(a) 0.425 0.395 15.782| Model summary | | | | |
| :--- | :--- | :--- | :--- | :--- |
| Model | R | R square | Adjusted R square | Std. error of the estimate |
| 1 | $0.652^{\mathrm{a}}$ | 0.425 | 0.395 | 15.782 |
^("a "){ }^{\text {a }} Predictors: (Constant), Service, Occupancy, Competition, Advertising, Price ^("a "){ }^{\text {a }} 预测因子:(常数)、服务、入住率、竞争、广告、价格
the model with the highest R^(2)R^{2}. Always consider the parsimony criterion, which means that simplicity in a regression model (or any model) is a virtue. Increasing the number of independent variables to gain tiny increases in explained variance may not be worth the trouble of measuring and including the variables. Fewer explanatory variables often leads to a more general model, which may be preferred. Since R^(2)R^{2} does not capture the element of parsimony, there is the RR-square adjusted (expressed as bar(R)^(2)\bar{R}^{2} or R_(adj)^(2)R_{a d j}^{2} ). As we mentioned earlier, it takes into account the degrees of freedom in the model ( n-k-1n-k-1 ), thus adjusting for additional independent variables. The equation is: R^(2)R^{2} 最高的模型。一定要考虑简约性标准,这意味着回归模型(或任何模型)的简约性是一种美德。为了获得解释方差的微小增长而增加自变量的数量,可能并不值得费力测量和包含这些变量。更少的解释变量往往会带来更通用的模型,这可能是我们更喜欢的。由于 R^(2)R^{2} 没有体现出简约性的要素,因此还有经过调整的 RR 平方(用 bar(R)^(2)\bar{R}^{2} 或 R_(adj)^(2)R_{a d j}^{2} 表示)。正如我们前面提到的,它考虑了模型中的自由度( n-k-1n-k-1 ),从而对额外的自变量进行调整。等式为
R_(adj)^(2)=1-((ESS)/((n-k-1)))/((TSS)/((n-1)))=1-((23,413.073)/((100-5-1)))/((40,752.990)/((100-1)))=39.5%R_{a d j}^{2}=1-\frac{\frac{E S S}{(n-k-1)}}{\frac{T S S}{(n-1)}}=1-\frac{\frac{23,413.073}{(100-5-1)}}{\frac{40,752.990}{(100-1)}}=39.5 \%
When the sample size is very large, the difference between R^(2)R^{2} and R_(adj)^(2)R_{a d j}^{2} is very small, so it is not so important to consider R_(adj)^(2)R_{a d j}^{2}. On the other hand, when the sample size is small, R_(adj)^(2)R_{a d j}^{2} is very important to consider. 当样本量非常大时, R^(2)R^{2} 和 R_(adj)^(2)R_{a d j}^{2} 之间的差异非常小,因此考虑 R_(adj)^(2)R_{a d j}^{2} 并不那么重要。另一方面,当样本量较小时, R_(adj)^(2)R_{a d j}^{2} 是非常重要的考虑因素。
10.7 The ANOVA Table and F-test 10.7 方差分析表和 F 检验
The next step in evaluating a regression equation is to examine the FF-test. 评估回归方程的下一步是检查 FF 检验。
F-test for Regression Models 回归模型的 F 检验
The FF-test is an evaluation of the entire model, and tests whether R^(2)R^{2} is significantly different from zero. When R^(2)R^{2} is significantly different than zero, you can say that the independent variable(s) explain a significant portion of the variance in the dependent variable. You can also say that it is a test of how well the model fits the data. The rule of thumb is to use a significance level of 5%(alpha=0.05)5 \%(\alpha=0.05). The general equation for FF is: FF 检验是对整个模型的评估,检验 R^(2)R^{2} 是否显著不同于零。当 R^(2)R^{2} 与零有显著差异时,可以说自变量解释了因变量中的很大一部分变异。也可以说,它是对模型与数据拟合程度的检验。经验法则是使用 5%(alpha=0.05)5 \%(\alpha=0.05) 的显著性水平。 FF 的一般等式为:
F=((RSS)/(k))/((ESS)/((n-k-1)))F=\frac{\frac{R S S}{k}}{\frac{E S S}{(n-k-1)}}
This is the hypothesis test: 这就是假设检验: H_(0):beta_(1)=beta_(2)=dotsbeta_(K)=0\mathrm{H}_{0}: \beta_{1}=\beta_{2}=\ldots \beta_{K}=0 H_(1)\mathrm{H}_{1} : At least one beta_(i)!=0\beta_{i} \neq 0 H_(1)\mathrm{H}_{1} :至少一个 beta_(i)!=0\beta_{i} \neq 0
The ANOVA output from the Hotel data with five independent variables is (Table 10.7): 酒店数据中五个自变量的方差分析结果如表 10.7 所示:
Substituting the values, we get the equation: 将这些数值代入,就得到了方程:
F=((RSS)/(k))/((ESS)/((n-k-1)))=((17,339.917)/(5))/((23,410.073)/((100-5-1)))=13.92F=\frac{\frac{R S S}{k}}{\frac{E S S}{(n-k-1)}}=\frac{\frac{17,339.917}{5}}{\frac{23,410.073}{(100-5-1)}}=13.92
The null hypothesis is rejected if the test-statistic for FF is larger than the critical cutoff value from the FF-tables. The degrees of freedom can be calculated with the formula: F_(alpha,k-1,n-k-1)F_{\alpha, k-1, n-k-1} 如果 FF 的检验统计量大于 FF 表中的临界临界值,则拒绝零假设。自由度可用公式计算: F_(alpha,k-1,n-k-1)F_{\alpha, k-1, n-k-1} alpha=5%\alpha=5 \%, which is the significance level we choose. K=K= the number of variables we are estimating, and n=100.5-1=4n=100.5-1=4 (the numerator), and 100-5-1=94100-5-1=94 (the denominator). Then, refer to the FF-table for alpha\alpha at 5%5 \%. The critical cutoff value for F_(alpha)=5%,4,94F_{\alpha}=5 \%, 4,94 is 2.33 . Since 13.92 > 2.3313.92>2.33, we reject the null hypothesis. The R^(2)R^{2} of 42.5%42.5 \% is significantly different from zero. alpha=5%\alpha=5 \% ,即我们选择的显著性水平。 K=K= 我们要估计的变量数,以及 n=100.5-1=4n=100.5-1=4 (分子)和 100-5-1=94100-5-1=94 (分母)。然后,参考 FF 表中的 alpha\alpha 在 5%5 \% 处。 F_(alpha)=5%,4,94F_{\alpha}=5 \%, 4,94 的临界截止值为 2.33 。由于 13.92 > 2.3313.92>2.33 ,我们拒绝零假设。 42.5%42.5 \% 的 R^(2)R^{2} 与零有显著差异。
Of course, with the advent of computers and their ability to calculate the significance probability ( p -value), the tables are somewhat redundant. Since the Sig. == 0.000 , and the cutoff is 0.05 ( 5%5 \% significance level), you simply see that 0.00 < 0.050.00<0.05, and conclude that the FF-statistic and thus R^(2)R^{2} are significantly different from zero. 当然,随着计算机及其计算显著性概率(P 值)能力的出现,这些表格就显得有些多余了。由于 Sig. == 0.000,而临界值是 0.05( 5%5 \% 显著性水平),你只需看到 0.00 < 0.050.00<0.05 ,就能得出 FF 统计量以及 R^(2)R^{2} 与零显著不同的结论。
Table 10.7 Hotel multiple regression ANOVA 表 10.7 酒店多元回归方差分析
ANOVA ^(a)^{a} 方差分析 ^(a)^{a}
Model 模型
Sum of squares 平方和
df
Mean square 均方
F
Sig.
1
RSS regression RSS 回归
17,339.91717,339.917
5
3467.983
13.923
.000^(b).000^{\mathrm{b}}
ESS residual ESS 剩余
23,413.07323,413.073
94
249.075
TSS Total TSS 总量
40,752.99040,752.990
99
Model Sum of squares df Mean square F Sig.
1 RSS regression 17,339.917 5 3467.983 13.923 .000^(b)
ESS residual 23,413.073 94 249.075
TSS Total 40,752.990 99 | Model | | Sum of squares | df | Mean square | F | Sig. |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | RSS regression | $17,339.917$ | 5 | 3467.983 | 13.923 | $.000^{\mathrm{b}}$ |
| | ESS residual | $23,413.073$ | 94 | 249.075 | | |
| | TSS Total | $40,752.990$ | 99 | | | |
The next step is to test whether each independent variable has a significant effect on the dependent variable. It is the same procedure for simple regression. In our example, we want to test whether the occupancy rate, price, advertising, competition, and service have significant effects on hotel revenue. For the tt-tables, we use the formula, n-k-1n-k-1, where nn is the sample size and kk is the number of variables. This is the theoretical model: 下一步是检验每个自变量是否对因变量有显著影响。这与简单回归的步骤相同。在我们的示例中,我们要检验入住率、价格、广告、竞争和服务是否对酒店收入有显著影响。对于 tt 表,我们使用公式 n-k-1n-k-1 ,其中 nn 是样本量, kk 是变量数。这就是理论模型:
These are the hypotheses for each independent variable: 这些是每个自变量的假设: H_(0):beta_(1)=0\mathrm{H}_{0}: \beta_{1}=0 and H_(1):beta_(1) > 0\mathrm{H}_{1}: \beta_{1}>0. H_(0):beta_(1)=0\mathrm{H}_{0}: \beta_{1}=0 和 H_(1):beta_(1) > 0\mathrm{H}_{1}: \beta_{1}>0 。 H_(0):beta_(2)=0\mathrm{H}_{0}: \beta_{2}=0 and H_(2):beta_(2) < 0\mathrm{H}_{2}: \beta_{2}<0. H_(0):beta_(2)=0\mathrm{H}_{0}: \beta_{2}=0 和 H_(2):beta_(2) < 0\mathrm{H}_{2}: \beta_{2}<0 。 H_(0):beta_(3)=0\mathrm{H}_{0}: \beta_{3}=0 and H_(3):beta_(3) > 0\mathrm{H}_{3}: \beta_{3}>0. H_(0):beta_(3)=0\mathrm{H}_{0}: \beta_{3}=0 和 H_(3):beta_(3) > 0\mathrm{H}_{3}: \beta_{3}>0 。 H_(0):beta_(4)=0\mathrm{H}_{0}: \beta_{4}=0 and H_(4):beta_(4) < 0\mathrm{H}_{4}: \beta_{4}<0. H_(0):beta_(4)=0\mathrm{H}_{0}: \beta_{4}=0 和 H_(4):beta_(4) < 0\mathrm{H}_{4}: \beta_{4}<0 。 H_(0):beta_(5)=0\mathrm{H}_{0}: \beta_{5}=0 and H_(5):beta_(5) > 0\mathrm{H}_{5}: \beta_{5}>0. H_(0):beta_(5)=0\mathrm{H}_{0}: \beta_{5}=0 和 H_(5):beta_(5) > 0\mathrm{H}_{5}: \beta_{5}>0 。
Note that all the hypotheses are one-sided. We choose the rule of thumb 5% significance level. Table 10.8 shows the SPSS regression output for the coefficients. SPSS, as well as most statistical programs, shows p-values (Sig.) for a two-sided test. To get a one-sided test, you have to divide the Sig. value by 2 . 请注意,所有假设都是单侧的。我们选择 5%的显著性水平。表 10.8 显示了系数的 SPSS 回归结果。SPSS 和大多数统计程序一样,会显示双侧检验的 p 值(Sig.)。要进行单侧检验,必须将 Sig 值除以 2。
For example, for occupancy, the estimate of the beta coefficient is: widehat(beta)_(1)=\widehat{\beta}_{1}=0.358,s_( hat(beta))=0.1160.358, s_{\hat{\beta}}=0.116 例如,就占用率而言,贝塔系数的估计值为 widehat(beta)_(1)=\widehat{\beta}_{1}=0.358,s_( hat(beta))=0.1160.358, s_{\hat{\beta}}=0.116
This gives a tt-value of: t=( widehat(beta)_(i))/((beta_(beta))/(lambda))=(0.358)/(0.116)=3.08t=\frac{\widehat{\beta}_{i}}{\frac{\beta_{\beta}}{\lambda}}=\frac{0.358}{0.116}=3.08 这样, tt 的值为: t=( widehat(beta)_(i))/((beta_(beta))/(lambda))=(0.358)/(0.116)=3.08t=\frac{\widehat{\beta}_{i}}{\frac{\beta_{\beta}}{\lambda}}=\frac{0.358}{0.116}=3.08
The critical cutoff value from the tt-tables for 100-5-1=94100-5-1=94 degrees of freedom, at a signfificance level of 5%5 \%, for a one-sided test, is 1.660.3.08 > 1.6601.660 .3 .08>1.660, so we reject the null hypothesis. We conclude that occupancy has a significant positive effect on revenue. The critical cutoff value is the same for all the coefficients because 从 tt 表中的 100-5-1=94100-5-1=94 自由度,在显著性水平为 5%5 \% 时,单侧检验的临界值为 1.660.3.08 > 1.6601.660 .3 .08>1.660 ,因此我们拒绝零假设。我们得出结论,占用率对收入有显著的正向影响。所有系数的临界值相同,因为
Table 10.8 Hotel multiple regression coefficients 表 10.8 酒店多元回归系数
they are all one-sided tests. We can save time looking in the tt-tables by just making a table of the p-values for a one-sided test (see Table 10.9). 都是单侧检验。我们可以直接制作一个单侧检验的 p 值表(见表 10.9),从而节省在 tt 表中查找的时间。
Standardized Beta Coefficients 标准化贝塔系数
In Table 10.7, there is a column called standardized beta coefficients. It shows standardized effects of all the beta coefficients so that you can compare them for their relative effect on the dependent variable. They can have values between -1 and +1 , and you compare their absolute values for relative effect size. The standardized beta is highest for price ( -0.519 ), second highest for occupancy (0.245), and lowest for advertising (0.177). Do not interpret insignificant beta coefficients, so we do not discuss their standardized beta coefficients. 在表 10.7 中,有一列称为标准化贝塔系数。它显示了所有贝塔系数的标准化效应,以便比较它们对因变量的相对效应。这些系数的值可以在 -1 和 +1 之间,比较它们的绝对值就可以得出相对效应大小。价格的标准化贝塔系数最高(-0.519),占用率次之(0.245),广告最低(0.177)。不要解释不显著的贝塔系数,因此我们不讨论它们的标准化贝塔系数。
The practical conclusions are that a one-unit change in price will have the greatest impact on revenue, followed by occupancy and advertising. It would be up to management to decide which of the variables are most actionable. Perhaps they do not want to start a price war with the competition, and they cannot directly affect occupancy, so they opt to increase advertising to increase revenue. Then, the question would be whether the added costs of advertising are offset by the additional revenue. 得出的实际结论是,一个单位的价格变化对收入的影响最大,其次是占用率和广告。至于哪些变量最具有可操作性,则要由管理层来决定。也许他们不想与竞争对手打价格战,也无法直接影响入住率,因此他们选择增加广告来增加收入。那么,问题就在于增加的收入是否能抵消增加的广告成本。
Regression Partial Plots 回归局部图
The partial plots make it possible to visually view the contribution of each independent variable onto the dependent variable. Interpretation of the plots and the regression coefficients must be considered within the context of the entire model. Partial 局部图可以直观地显示每个自变量对因变量的影响。对局部图和回归系数的解释必须结合整个模型来考虑。局部图
Fig. 10.13 Partial regression plot for occupancy on revenue 图 10.13 占用率对收入的部分回归图
plots provide a good way of evaluating the linearity of the relationship between each specific independent variable and the dependent variable. They are also good for identifying multivariate outliers. For example, we highlight one data point on all three plots that may be considered an outlier (observation 83). To generate partial plots, with the dialogue box open for linear regression, choose Plots, and then tick Produce all partial plots. Click OK (see Fig. 10.13). Note that we also requested a Histogram and Normal probability plot (Standardized Residual Plots). These are necessary for later residual analysis. 曲线图是评估每个特定自变量与因变量之间线性关系的好方法。它们还能很好地识别多元离群值。例如,我们在三幅图中都突出显示了一个可能被视为离群值的数据点(观测值 83)。要生成局部图,请打开线性回归对话框,选择 "图",然后勾选 "生成所有局部图"。单击确定(见图 10.13)。请注意,我们还要求生成直方图和正态概率图(标准化残差图)。这些对于以后的残差分析是必要的。
Figure 10.14 shows the partial plot for Price on Revenue. As expected, the slope is negative, meaning that higher prices lead to lower revenue. It is beyond the scope of our discussion. However, it is worth noting that if price was very low, raising the price might increase revenue. This will be related to the price elasticity theory. Nevertheless, within the range of prices we are testing, the relationship is linear and negative. 图 10.14 显示了价格对收入的部分影响。不出所料,斜率为负,这意味着价格越高,收入越低。这超出了我们的讨论范围。但值得注意的是,如果价格很低,提高价格可能会增加收入。这将与价格弹性理论有关。不过,在我们测试的价格范围内,这种关系是线性和负相关的。
Observe from the coefficients table, and it is noted on the graph, that the slope of the line is -7.05 . The R^(2)R^{2} for the partial regression is 28%28 \%, indicating that price accounts for a large portion of the explained variance. This is also observed in the standardized beta coefficients, where price clearly has the largest absolute effect (-0.519). 从系数表中可以观察到,曲线的斜率为-7.05。部分回归的 R^(2)R^{2} 为 28%28 \% ,表明价格占解释方差的很大一部分。从标准化贝塔系数中也可以看出这一点,价格显然具有最大的绝对影响(-0.519)。
The occupancy partial plot in Fig. 10.15 has a regression line slope of 0.36, and explained variance of 9.2%9.2 \%. Observation 83 appears to be an outlier. 图 10.15 中的占用率局部图的回归线斜率为 0.36,解释方差为 9.2%9.2 \% 。观测值 83 似乎是一个离群值。
The advertising partial plot in Fig. 10.16 also shows observation 83 as a possible outlier. Though we will not do it in this analysis, the next step would be to remove observation 83 to see whether the results change in a substantive way. If they do, observation 83 should be left out. One observation should not determine the results of the regression. If there is no substantive change, then it could be kept. 图 10.16 中的广告局部图也显示观察值 83 可能是一个离群值。虽然我们不会在本分析中这样做,但下一步可以删除 83 号观测值,看看结果是否有实质性的变化。如果有变化,则应删除 83 号观测值。一个观测值不应该决定回归结果。如果没有实质性变化,则可以保留。
Fig. 10.14 Partial regression plot for Price on Revenue 图 10.14 价格对收入的部分回归图
Fig. 10.15 Partial regression plot for Occupancy on Revenue 图 10.15 占用率对收入的部分回归图
10.8 Too Many or Too Few Independent Variables 10.8 独立变量过多或过少
Choosing how many independent variables to include in a regression is a trade-off between parsimony and explanatory power. In our current example with the Hotel data, competition and service are not significantly related to revenue, so they can be removed. On the other hand, there are other variables in the dataset that should perhaps be included. 选择在回归中包含多少个自变量,需要在简洁性和解释力之间进行权衡。在我们当前的酒店数据示例中,竞争和服务与收入关系不大,因此可以将其删除。另一方面,数据集中的其他变量或许也应该包括在内。
Fig. 10.16 Partial regression plot for Advertising on Revenue 图 10.16 广告收入的部分回归图
Specification error refers to including irrelevant variables, not including important variables, or choosing the wrong functional form. Functional form refers to linear and nonlinear models, which we discuss briefly later. In this section, we restrict the discussion to the consequences of too many or too few independent variables. Having too many independent variables is a luxury problem. You can always remove them. Missing important explanatory variables is a more serious problem. Choosing independent variables is challenging when: 规范错误指的是包含无关变量、不包含重要变量或选择了错误的函数形式。函数形式指的是线性模型和非线性模型,我们将在后面简要讨论。在本节中,我们只讨论自变量过多或过少的后果。自变量过多是一个奢侈的问题。您可以随时删除它们。遗漏重要的解释变量则是更严重的问题。在以下情况下,选择自变量具有挑战性
You do not know which variables are relevant. 你不知道哪些变量是相关的。
You have limited access to variables. 您访问变量的权限有限。
You are using proxy variables (substitute variables) and you do not know how well they measure what you intend. 您使用的是替代变量(替代变量),您不知道这些变量对您的意图有多大的衡量作用。
Consequence of a Too Long Model 模型过长的后果
Assume that Y_(i)=beta_(0)+beta_(1)X_(1)+epsi_(i)Y_{i}=\beta_{0}+\beta_{1} X_{1}+\varepsilon_{i} is the true (optimal) regression model. Then, add an additional independent variable to get the model: Y_(i)=beta_(0)+beta_(1)X_(1)+beta_(2)X_(2)+epsi_(i)Y_{i}=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\varepsilon_{i}. If X_(1)X_{1} and X_(2)X_{2} are correlated with each other, the variance in X_(1)X_{1} will be randomly affected. The consequence is to reduce the precision in beta_(1)\beta_{1}. The same applies to all the beta coefficients in a multiple regression. A too long model reduces the precision of the beta coefficients. In general, insignificant or meaningless independent variables should not be kept in a regression. However, sometimes the researcher wants to make the point that a particular variable is insignificant. Imagine a study on the gender pay gap. Finding that gender is not significant would be very meaningful for the results. Keeping it in the regression may reduce the precision of the other beta coefficients. However, demonstarting that it is insignificant would outweigh the loss in precision. 假设 Y_(i)=beta_(0)+beta_(1)X_(1)+epsi_(i)Y_{i}=\beta_{0}+\beta_{1} X_{1}+\varepsilon_{i} 是真正的(最优)回归模型。然后,增加一个自变量,得到模型: Y_(i)=beta_(0)+beta_(1)X_(1)+beta_(2)X_(2)+epsi_(i)Y_{i}=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\varepsilon_{i} 。如果 X_(1)X_{1} 和 X_(2)X_{2} 相互关联, X_(1)X_{1} 中的方差将受到随机影响。其结果是降低 beta_(1)\beta_{1} 的精度。这同样适用于多元回归中的所有贝塔系数。过长的模型会降低贝塔系数的精度。一般来说,回归中不应保留不重要或无意义的自变量。但是,有时研究人员想说明某个变量不重要。试想一项关于性别薪酬差距的研究。发现性别不重要对研究结果非常有意义。将其保留在回归中可能会降低其他贝塔系数的精度。但是,如果能证明它不重要,就能抵消精度上的损失。
Table 10.10 Model summary with number of employees 表 10.10 含雇员人数的模型汇总
Model summary 模型概要
R square R 平方
Adjusted R square 调整后的 R 平方
Std. error of the estimate 估计值的标准误差
Model 模型
R
0.95^(a)0.95^{\mathrm{a}}
0.874
0.869
1
0.9353
Model summary R square Adjusted R square Std. error of the estimate
Model R 0.95^(a) 0.874 0.869
1 0.9353 | Model summary | | | | | | R square | Adjusted R square | Std. error of the estimate |
| :--- | :--- | :--- | :--- | :--- | :---: | :---: | :---: | :---: |
| Model | R | $0.95^{\mathrm{a}}$ | 0.874 | 0.869 | | | | |
| 1 | 0.9353 | | | | | | | |
Assume that Y_(i)=beta_(0)+beta_(1)X_(1)+beta_(2)X_(2)+epsi_(i)Y_{i}=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\varepsilon_{i} is the true regression model. Assume that X_(2)X_{2} is dropped from the equation. The resulting model is: Y_(i)=beta_(0)+beta_(1)X_(1)+epsi_(i)Y_{i}=\beta_{0}+\beta_{1} X_{1}+\varepsilon_{i}. If X_(1)X_{1} and X_(2)X_{2} are correlated, then the estimated beta coefficient widehat(beta)_(1)\widehat{\beta}_{1} will be systematically biased. This is because the parameter estimates are a function of the correlation between the included and excluded independent variables. The estimated error variance will also be miss-specified, which leads to flawed confidence intervals. In sum, missing important independent variable(s) causes a systematic bias in the parameter estimates, which is more serious than just losing precision. 假设 Y_(i)=beta_(0)+beta_(1)X_(1)+beta_(2)X_(2)+epsi_(i)Y_{i}=\beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\varepsilon_{i} 是真正的回归模型。假设 X_(2)X_{2} 从方程中删除。得出的模型是 Y_(i)=beta_(0)+beta_(1)X_(1)+epsi_(i)Y_{i}=\beta_{0}+\beta_{1} X_{1}+\varepsilon_{i} 。如果 X_(1)X_{1} 和 X_(2)X_{2} 相互关联,那么估计的贝塔系数 widehat(beta)_(1)\widehat{\beta}_{1} 将出现系统性偏差。这是因为参数估计是包含自变量和排除自变量之间相关性的函数。估计误差方差也会被错误指定,从而导致置信区间错误。总之,遗漏重要的自变量会导致参数估计的系统性偏差,这比仅仅失去精度更为严重。
For an example of a too short model, in our current hotel example, add number of employees to the independent variables, and remove competition and service. The results are shown in Tables 10.10, 10.11, and 10.12. 以我们当前的酒店为例,在自变量中加入员工人数,去掉竞争和服务,以说明模型太短。结果如表 10.10、10.11 和 10.12 所示。
Note the dramatic increase in explained variance (R_(adj)^(2))\left(R_{a d j}^{2}\right), from 39.5%39.5 \% (see Table 10.5) to 86.9%86.9 \% in Table 10.10. Clearly, number of employees is an extremely important variable for explaining revenue. 请注意,从表 10.10 中的 39.5%39.5 \% (见表 10.5)到 86.9%86.9 \% ,解释方差 (R_(adj)^(2))\left(R_{a d j}^{2}\right) 显著增加。显然,员工人数是解释收入的一个极其重要的变量。
In Table 10.11, the significance probablility ( p -value) is below 0.05 , indicating that the R^(2)R^{2} is singificantly different from zero. The equation fits the data well. 在表 10.11 中,显著性概率(p 值)低于 0.05,表明 R^(2)R^{2} 与零有显著性差异。方程与数据拟合良好。
In Table 10.12, we see that the variables price and number of employees significantly influence revenue, whereas occupancy and advertising have become insignificant. This means that the influence of the number of employees variable is so strong that the effects of occupancy and advertising disappear. We re-estimate the model without occupancy and advertising. 从表 10.12 中我们可以看出,价格和员工人数变量对收入的影响很大,而占用率和广告的影 响变得不明显。这说明员工人数变量的影响非常大,以至于占用率和广告的影响消失了。我们重新估计了没有占用率和广告的模型。
The R^(2)R^{2} for the final model is 87.3%87.3 \%, with a corresponding FF-statistics of 333.062 (df. 2, 97), and a p-value of 0.000 . We conclude that the model fits the data well. The coefficients table is shown in Table 10.13. 最终模型的 R^(2)R^{2} 为 87.3%87.3 \% ,相应的 FF 统计量为 333.062(df. 2,97),p 值为 0.000。我们得出结论,该模型与数据拟合良好。系数表见表 10.13。
According to the unstandardized betas, price has a negative effect on revenue, whereas the number of employees has a positive effect. The standardized betas show that, by absolute value, the number of employees has a much larger effect (0.846 > 1-0.1601)(0.846>1-0.1601). A one-unit increase in price will reduce revenue by 根据未标准化的 betas,价格对收入有负面影响,而员工人数则有正面影响。标准化投注量显示,从绝对值来看,雇员人数的影响更大 (0.846 > 1-0.1601)(0.846>1-0.1601) 。价格每增加一个单位,收入将减少
Table 10.12 Coefficients with number of employees 表 10.12 与雇员人数有关的系数
^("a "){ }^{\text {a }} Dependent variable: Revenue ^("a "){ }^{\text {a }} 因变量:收入
2.172 units. A one-unit increase in employees will have a 0.563 unit increase on revenue. From a pragmatic perspective, increasing price means fewer guests come and revenue drops. Increasing the number of employees probably improves service in some way that attracts more guests, so revenue increases. These effects apply to this range of data. At more extreme levels, the effects probably become nonlinear. For example, radically increasing the number of employees will raise costs to a point where revenue falls. 2.172 个单位。员工增加一个单位,收入就会增加 0.563 个单位。从实用的角度来看,提高价格意味着客人减少,收入下降。增加员工数量可能会在某种程度上改善服务,吸引更多客人,从而增加收入。这些效应适用于这一系列数据。在更极端的情况下,效果可能会变得非线性。例如,大幅增加员工人数会使成本上升到收入下降的程度。
10.9 Regression with Dummy Variables 10.9 利用虚拟变量进行回归
Several times throughout the book we have discussed nominal level variables, that is, categorical variables with no order. Dummy variables are categorical binary variables usually coded with the values 0 and 1 . In ordinary least squares regression, they can be included as independent variables. They cannot be a dependent variable. There are special forms of regression for handling categorical dependent variables, for example, logistic regression. However, this is beyond the scope of this textbook. 在本书中,我们多次讨论了名义变量,即没有顺序的分类变量。虚拟变量是二进制分类变量,通常编码为 0 和 1。在普通最小二乘法回归中,它们可以作为自变量。它们不能作为因变量。有一些特殊形式的回归可以处理分类因变量,例如逻辑回归。不过,这超出了本教科书的范围。
A requirement for testing group differences with dummy variables in OLS regression is that there must be k-1k-1 dummy variables for kk categories. In the Hotel data, there is a Segment variable that has three categories: vacation, conference, and business. In the next section, we will demonstrate how to recode variables to form dummy variables. Then, we will show how to include them in OLS regression. 在 OLS 回归中使用虚拟变量测试组别差异的要求是, kk 类别必须有 k-1k-1 虚拟变量。在酒店数据中,有一个包含三个类别的分类变量:度假、会议和商务。在下一节中,我们将演示如何对变量重新编码以形成虚拟变量。然后,我们将演示如何将它们纳入 OLS 回归。
Fig. 10.17 Recode into different variable 图 10.17 重新编码为不同变量
Forming Dummy Variables 形成虚拟变量
In this example, we will investigate whether there is a difference in the dependent variable Revenue, for the three hotel segments: vacation, conference, and business. We will also include the independent variable, Occupancy. Following the k-1k-1 dummy rule, we need to form two dummy variables. In the Hotel dataset, we have already created them for you, called Vacation and Conference. Nevertheless, we will show you how to create them. 在本例中,我们将研究度假、会议和商务三个酒店细分市场的因变量收入是否存在差异。我们还将包括自变量入住率。根据 k-1k-1 虚拟变量规则,我们需要形成两个虚拟变量。在酒店数据集中,我们已经为您创建了这两个变量,分别称为 "度假 "和 "会议"。不过,我们将向您展示如何创建它们。
In SPSS, choose Transform > Recode into Different Variables. Move the Segment variable into the Numeric Variable position. Name the Output Variable something (we chose Vacation), and move it to the Output Variable position (see Fig. 10.17). 在 SPSS 中,选择 "转换">"重新编码为不同变量"。将分段变量移到数值变量位置。为输出变量命名(我们选择了假期),并将其移动到输出变量位置(见图 10.17)。
Click on the Old and New Values button to open the Recode dialogue box. In the Recode dialogue box, insert 1 as the Old Value and 1 as the New Value, and click Add. Continue by putting 2 in the Old Value and 0 in the New Value, then click Add. Finally, put 3 in the Old Value and 0 in the New Value, and click Add (see Fig. 10.18). This means that all values coded 1 in the Segment variable (Vacation) will be coded 1 , while the other two categories will be coded 0 . Click Continue and OK, to create the new dummy variable, Vacation. 单击 "旧值 "和 "新值 "按钮,打开 "重新编码 "对话框。在重新编码对话框中,插入 1 作为旧值,1 作为新值,然后单击添加。继续在 "旧值 "中输入 2,在 "新值 "中输入 0,然后单击 "添加"。最后,在 "旧值 "中输入 3,在 "新值 "中输入 0,然后单击 "添加"(见图 10.18)。这意味着在分段变量(假期)中编码为 1 的所有值都将编码为 1,而其他两个类别的值将编码为 0。单击 "继续 "和 "确定",创建新的虚拟变量 "假期"。
Repeat the process to create the second dummy variable for Conference. However, this time category 2 , which represents conference hotels, is coded 1 , and all other categories are coded 0 (see Fig. 10.19). 重复上述过程,创建第二个会议虚拟变量。不过,这次代表会议酒店的类别 2 被编码为 1,所有其他类别被编码为 0(见图 10.19)。
As a result, there will be two dummy variables where the Vacation dummy variable is coded 1 for each vacation hotel and 0 for all other hotels. The Conference dummy variable is coded 1 for each conference hotel and 0 for all other hotels. Business hotels are represented in the two dummy variables as the category that is always coded 0. 因此,将产生两个虚拟变量,其中度假虚拟变量对每家度假酒店的编码为 1,对所有其他酒店的编码为 0。会议虚拟变量对每家会议酒店编码为 1,对所有其他酒店编码为 0。在这两个虚拟变量中,商务酒店类别的代码始终为 0。
Fig. 10.18 Recode into different variables: old and new values 图 10.18 重新编码为不同变量:新旧值
Fig. 10.19 Recode into different variables: old and new values 图 10.19 重新编码为不同变量:新旧值
Figure 10.20 shows the regression dialogue box with Revenue as the dependent variable, the two dummy variables Vacation and Conference, and the continuous independent variable Occupancy. The dummy variables are simply treated as additional independent variables. In plain terms, we are testing for segment group differences while controlling for the level of occupancy. 图 10.20 显示了回归对话框,因变量为收入,两个虚拟变量为假期和会议,连续自变量为 占用率。虚拟变量被简单地视为额外的自变量。通俗地说,我们是在控制入住率水平的同时测试细分市场的差异。
The R^(2)R^{2} for the equation is 26.6%26.6 \%, with an FF-statistic of 11.585 and 3,96 degrees of freedom. The p-value for the FF-statistic is 0.000 , indicating that the R^(2)R^{2} is significantly greater than zero. Table 10.14 shows the output for the regression coefficients. Applying the rule of thumb cutoff of 0.05 , the p-values (Sig.) indicate that all independent variables have a statistically significant relationship with the dependent variable. Vacation has a p-value of 0.000 , Conference has 0.001 , and Occupancy has 0.043 . The easiest way to explain their interpretation is to show a graph. 方程的 R^(2)R^{2} 为 26.6%26.6 \% , FF 统计量为 11.585,自由度为 3.96。 FF 统计量的 p 值为 0.000,表明 R^(2)R^{2} 显著大于零。表 10.14 显示了回归系数的输出结果。根据 0.05 的经验法则,P 值(Sig.)表明,所有自变量与因变量之间都有显著的统计关系。假期的 p 值为 0.000,会议的 p 值为 0.001,入住率的 p 值为 0.043。解释它们的最简单方法是显示图表。
Figure 10.21 shows the positive slope of the Occupancy regression line (0.292), intersecting the YY-axis at 17.74 (rounded off). This is the regression line that 图 10.21 显示占用率回归线的正斜率(0.292),与 YY 轴相交于 17.74(四舍五入)。这条回归线
Fig. 10.20 Regression dialogue box 图 10.20 回归对话框
Table 10.14 Regression coefficients with dummy variables 表 10.14 含虚拟变量的回归系数
^(a){ }^{a} Dependent variable: Revenue ^(a){ }^{a} 因变量:收入
represents business hotels. This is because it is the category that is always zero when combining the Vacation and Conference dummy variables. To understand the effect of the dummy variables, start by adding the unstandardized beta coefficient for Conference ( -16.55 ) to the constant (17.74) to get the YY-intercept of 1.19. Then, add the unstandardized beta coefficient for Vacation ( -20.369 ) to the constant (17.74) to get the YY-intercept of -2.63 . 代表商务酒店。这是因为在合并假期和会议虚拟变量时,商务酒店总是为零。要了解虚拟变量的影响,首先将会议的非标准化贝塔系数(-16.55)与常数(17.74)相加,得到 YY -截距 1.19。然后,将假期 ( -20.369 ) 的非标准化贝塔系数与常数 (17.74) 相加,得到 YY -截距为 -2.63 。
As the graph in Fig. 10.21 indicates, dummy variables shift the regression line up or down from the baseline category, which in this case is business hotels. Business hotels generate more revenue at each level of occupancy than conference hotels, followed by vacation hotels. This example shows the danger of interpreting the intercept. Presumably, none of the hotel types would generate revenue with zero 如图 10.21 所示,虚拟变量将回归线从基准类别向上或向下移动,本例中的基准类别为商务酒店。在每个入住率水平上,商务酒店都比会议酒店产生更多收入,其次是度假酒店。这个例子说明了解释截距的危险性。假定没有任何一种酒店类型会在零入住率的情况下产生收益。
occupancy. Nevertheless, it shows the degree by which revenue increases as occupancy increases for each segment. 尽管如此,它显示了每个分部的收入随着入住率的增加而增加的程度。不过,它显示了每个分部的收入随着入住率的提高而增加的程度。
To summarize, the continuous independent variable Occupancy forms a regression line with a YY-intercept at the constant. The dummy variables shift the line up or down by the magnitude of their respective unstandardized beta coefficients. 概括地说,连续自变量 "占用率 "形成了一条回归线,其 YY -截距为常数。虚拟变量根据各自未标准化贝塔系数的大小上下移动回归线。
10.10 Dummy Regression: An Alternative Analysis of Covariance ANCOVA 10.10 假回归:另一种协方差方差分析
In Chap. 9, we explained the independent samples tt-test as a way of testing mean differences between two independent samples. The independent samples tt-test is directly related to the tt-test of the regression coefficients. In Chap. 9, we tested for mean differences of Revenue for Hotel type: boutique coded 0 and chain coded 1. The mean revenue for each hotel type is shown in Table 10.15. 在第 9 章中,我们解释了独立样本 tt 检验,它是检验两个独立样本之间均值差异的一种方法。独立样本 tt 检验与回归系数 tt 检验直接相关。在第 9 章中,我们测试了酒店类型收入的平均差异:精品酒店编码为 0,连锁酒店编码为 1。每种酒店类型的平均收入如表 10.15 所示。
The difference is statistically significant with a tt-statistic of -2.276 and corresponding p-value of 0.025 . If we instead run an OLS regression with Revenue as the dependent variable and Hotel type as the independent variable, we get the coefficients table shown in Table 10.16. 这一差异具有统计意义,统计量为 -2.276,相应的 p 值为 0.025。如果我们以收入为因变量,酒店类型为自变量,进行 OLS 回归,则得到表 10.16 所示的系数表。
Note that the absolute value of the tt-statistic is the same (2.276) as well as the p-value (0.025). The conclusion from the OLS regression based on the test of the dummy variable Hotel type is the same. There is a significant difference between group means for the dependent variable, Revenue. The mean revenue for the category coded 0 (boutique) is equal to the constant (15.450). The mean revenue for the category coded 1 is the constant + the unstandardized beta coefficient. 请注意, tt 统计量的绝对值(2.276)和 p 值(0.025)是相同的。基于虚拟变量 "酒店类型 "检验的 OLS 回归得出的结论也是一样的。因变量 "收入 "的组间均值存在显著差异。编码为 0(精品酒店)的类别的平均收入等于常数(15.450)。编码为 1 的类别的平均收入为常数 + 未标准化的贝塔系数。
15.450+9.233=24.68315.450+9.233=24.683
When simply testing for group differences on a continuous dependent variable, it is simpler to use an independent samples t-test, or if there are more than two groups, a one-way ANOVA with post hoc tests (also explained in Chap. 9). However, for ANCOVA (analysis of covariance) we suggest using OLS regression. 如果只是检验连续因变量的组间差异,使用独立样本 t 检验比较简单,如果有两组以上,则使用单向方差分析和事后检验(第 9 章中也有解释)。不过,对于方差分析(协方差分析),我们建议使用 OLS 回归。
Table 10.15 Mean revenue differences for hotel type 表 10.15 酒店类型的平均收入差异
ANCOVA is a combination of ANOVA and regression, falling under the heading of general linear models. In SPSS, under Analyze, you will find the General Linear Model category, with several subcategories. In simple terms, you test for group mean differences in a continuous dependent variable, while at the same time testing for covariance in additional continuous independent variables. We just did this in the regression with Revenue as the dependent variable, Occupancy as a continuous independent variable, and the two dummy variables Conference and Vacation to test for group mean differences across the three segments of vacation, conference, and business hotels. 方差分析是方差分析和回归的结合,属于一般线性模型的范畴。在 SPSS 的 "分析 "下,您可以找到 "一般线性模型 "类别,其中有几个子类别。简单来说,就是测试连续因变量的组均差异,同时测试其他连续自变量的协方差。我们在回归中就是这样做的,收入是因变量,入住率是连续自变量,会议和度假是两个虚拟变量,用来测试度假、会议和商务酒店三个细分市场的组均差异。
Table 10.17 shows the group means for Revenue by Segment. 表 10.17 显示了按组别分列的收入平均数。
Table 10.18 shows the coefficients for an OLS regression with Revenue as the dependent variable, and Vacation and Conference as the independent variables. The constant (36.654) is the mean for the group always coded 0 , which in this case is business hotels (36.65). To get the means for the other two categories, simply add their respective unstandardized coefficients to the constant: 表 10.18 显示了以收入为因变量,假期和会议为自变量的 OLS 回归系数。常数 (36.654) 是始终编码为 0 的组别的平均值,在本例中为商务酒店 (36.65)。要获得其他两个类别的平均值,只需将其各自的非标准化系数与常数相加即可:
The tt-statistic and corresponding p -value indicate whether there is a significant difference between the category represented by the dummy variable and the category tt 统计量和相应的 p 值表明,虚拟变量所代表的类别与实际类别之间是否存在显著差异。
Table 10.17 Means for revenue by segment 表 10.17 按部门分列的收入平均数
^("a "){ }^{\text {a }} Dependent variable: Revenue ^("a "){ }^{\text {a }} 因变量:收入
represented by the constant. With a p-value of 0.000 , the mean for Vacation is significantly different from the mean for Business. With a p-value of 0.000 the mean for Conference is significantly different from the mean for Business. 由常数表示。当 p 值为 0.000 时,假期的平均值与业务的平均值有显著差异。当 p 值为 0.000 时,会议的平均值与业务的平均值明显不同。
Since there are no post hoc tests in OLS regression, the only way to test for a significant difference between the pairing of vacation and conference is to drop one of them from the regression and replace it with the third category dummy, which in this case is business. Table 10.19 shows the OLS regression coefficients. 由于 OLS 回归中不存在事后检验,检验休假和会议配对之间是否存在显著差异的唯一方法就是从回归中剔除其中一个,代之以第三类虚拟变量,本例中的第三类虚拟变量为商务。表 10.19 显示了 OLS 回归系数。
The p-value of 0.099 for conference indicates that there is no statistically significant difference between mean revenue for vacation and conference hotels. 会议酒店的 p 值为 0.099,表明度假酒店和会议酒店的平均收入在统计上没有显著差异。
The Advantage 优势
While there is the hassle of testing the final category for significant mean differences, there is an advantage with respect to analysis of covariance ANCOVA. In our example including Occupancy as an additional independent variable, we are testing the covariance of occupancy on revenue when controlling for segment through the two dummy variables. We can include several independent variables, both continuous and dichotomous dummies. 虽然测试最终类别是否存在显著的均值差异比较麻烦,但协方差方差分析却有其优势。在我们的例子中,占用率是一个额外的自变量,当通过两个虚拟变量控制分部时,我们正在测试占用率对收入的协方差。我们可以包含多个自变量,既可以是连续哑变量,也可以是二分哑变量。
10.11 The Classic Assumptions of Multiple Regression 10.11 多元回归的经典假设
There are seven assumptions upon which OLS regression is based. OLS regression is quite robust against deviations from the assumptions. Nevertheless, they should be considered. In practice, we address some of the assumptions as we prepare the data OLS 回归所依据的假设有七个。OLS 回归对偏离假设的情况相当稳健。尽管如此,仍应考虑这些假设。在实际操作中,我们在准备数据时会考虑到其中的一些假设
Table 10.19 Regression coefficients replacing vacation with business 表 10.19 用公务取代假期的回归系数
^(a){ }^{a} Dependent variable: Revenue ^(a){ }^{a} 因变量:收入
for regression analysis. Then, when we do the regression analysis we follow it up with a residual analysis. Often, it is an iterative process until we have a regression model that satisfies the assumptions and represents the data. 进行回归分析。然后,在进行回归分析后,我们再进行残差分析。这通常是一个反复的过程,直到我们得到一个既能满足假设又能代表数据的回归模型。
For testing the assumptions, we use the final Hotel data regression model with revenue as the dependent variable, and price and number of employees as the independent variables. 为了检验假设,我们使用了最终的酒店数据回归模型,收入为因变量,价格和员工人数为自变量。
(1) The Regression Model Is Linear in the Coefficients, Is Correctly Specified, and Has an Additive Error Term (1) 回归模型的系数是线性的,指定是正确的,并且有一个加性误差项
So long as we have done a thorough job of specifying the model, based on theory and logic, we assume this assumption to be satisfied. There are a few important things to know. The equations do not have to be linear in the variables, only in the parameters. This allows for using nonlinear variables, like quadratic functions, as variables. We discussed the problems associated with too short and too long models. We specify the best model we can under the circumstances we are presented with. Finally, all regressions must include an error term. So long as you do not specify anything in the software other than to include the error term (it is usually default), this assumption is satisfied. 只要我们在理论和逻辑的基础上对模型进行了详尽的说明,我们就可以认为这一假设得到了满足。有几件重要的事情需要了解。方程中的变量不一定是线性的,参数才是线性的。这就允许使用非线性变量(如二次函数)作为变量。我们讨论过模型太短和太长的问题。我们会根据实际情况指定最佳模型。最后,所有回归都必须包含误差项。只要不在软件中指定包含误差项(通常是默认的),就可以满足这一假设。
We can get an indication of whether there may be problems with this assumption by examining the partial plots and a scatterplot of the standardized predicted values ( XX-axis) on the standardized residuals ( YY-axis). For the partial plots, choose analyze >> regression >> Linear. Put revenue in the dependent box and choose price and number of employees for the independent variables. Open Plots, and choose Produce all partial plots. For use in subsequent analysis, also choose Histogram and Normal probability plot (refer back to Fig. 10.13). Click continue. Then, click Save, and choose Standardized predicted values and Standardized residuals (see Fig. 10.22). Click Continue and OK. 我们可以通过检查部分图和标准化预测值( XX 轴)与标准化残差( YY 轴)的散点图,来了解这一假设是否存在问题。对于局部图,选择分析 >> 回归 >> 线性。在因变量框中填入收入,在自变量中选择价格和员工人数。打开 "绘图",选择 "生成所有部分绘图"。为了在后续分析中使用,还要选择直方图和正态概率图(请参考图 10.13)。单击继续。然后,单击保存,并选择标准化预测值和标准化残差(参见图 10.22)。单击继续和确定。
The partial plots are shown in Figs. 10.23 and 10.24. We have taken the liberty of identifying potential outliers. We will not do anything about outliers in the current example. However, we should test the results without the outliers to see if the estimates substantively change. Aguinis et al. (2013) offer an excellent discussion of how to address outliers. 局部图如图 10.23 和 10.24 所示。我们冒昧地识别了潜在的异常值。在本例中,我们不会对异常值做任何处理。不过,我们应该测试没有异常值的结果,看看估计值是否发生了实质性变化。Aguinis 等人(2013 年)对如何处理异常值进行了精彩的讨论。
Despite the outliers, the number of employees has a strong linear relationship with revenue. We do not see any unusual patters or a tendency toward being curved. 尽管存在异常值,但员工人数与收入之间的线性关系很强。我们没有发现任何异常模式或曲线趋势。
Fig. 10.22 Regression save 图 10.22 回归保存
Fig. 10.23 Partial plot for number of employees 图 10.23 员工人数局部图
Price has less of a clear slope. However, it appears linear, especially once outliers are removed. 价格的斜率不太明显。不过,它似乎是线性的,尤其是在剔除异常值之后。
To estimate the residual plot, choose Graphs > Chart Builder > Scatter/Dot. Drag and drop a simple scatter into the Gallery. Drag the new variables we saved to their appropriate axes. Choose OKO K (see Fig. 10.25). 要估算残差图,请选择图表 > 图表生成器 > 散点图/点图。将一个简单的散点图拖放到图库中。将我们保存的新变量拖到相应的坐标轴上。选择 OKO K (见图 10.25)。
The output is shown in Fig. 10.26. Again, we have taken the liberty to highlight outliers. The data, in spite of the outliers, is somewhat clumped together. However, it appears random and there is no nonlinear pattern. Our conclusion is that assumption 1 is met. 输出结果如图 10.26 所示。我们再次冒昧地突出了异常值。尽管存在异常值,但数据在一定程度上是聚集在一起的。不过,数据似乎是随机的,并不存在非线性模式。我们的结论是假设 1 成立。
(2) The Error Term Has a Zero Population Mean (2) 误差项的人口均值为零
This assumption cannot be tested. It is assumed to be satisfied so long as the constant (beta_(0))\left(\beta_{0}\right) has been included in the equation. 这一假设无法检验。只要方程中包含常数 (beta_(0))\left(\beta_{0}\right) ,就可以认为该假设成立。
(3) The Independent Variables Are Uncorrelated with the Error Term (3) 独立变量与误差项不相关
This happens, for example, when something that is unaccounted for in the independent variables, meaning it is in the error term, is correlated with one or more of the 例如,当自变量中没有考虑的因素(即误差项中的因素)与一个或多个变量相关时,就会 出现这种情况。
Fig. 10.24 Partial plot for price 图 10.24 价格局部图
independent variables. In recent years, this is referred to as the endogeneity problem. In our regression example, we modeled price and number of employees as independent variables onto revenue as a dependent variable. We have assumed a linear relationship. What if, which might easily be the case, the hotels raise their prices and number of employees in high season? 自变量。近年来,这种情况被称为内生性问题。在我们的回归示例中,我们将价格和员工人数作为自变量,将收入作为因变量。我们假设两者之间存在线性关系。如果酒店在旺季提高价格并增加员工数量,情况可能会很容易发生。
Season affects revenue, price, and staffing. If the seasonal changes are substantial to the extent that they would be statistically significant, then because they are in the error term, the independent variables are correlated with the error term. It also means that the unaccounted effect simultaneously influences the dependent and independent variables. This assumption highlights the importance of proper model specification. 季节会影响收入、价格和人员配置。如果季节性变化很大,以至于在统计上很显著,那么由于它们在误差项中,自变量就与误差项相关。这也意味着未计效应同时影响因变量和自变量。这一假设凸显了正确规范模型的重要性。
(4) Uncorrelated Error Terms (no Serial Correlation) (4) 无相关误差项(无序列相关性)
This assumption is often violated in time series analysis. The residuals from data collected in one time are correlated with the residuals from data collected in another time. To test this assumption, plot any sequencing variables, like time, with the standardized residuals. If patterns emerge, then the assumption is violated. Unless there is a sequencing variable in the data, this assumption is usually satisfied. 在时间序列分析中经常会违反这一假设。某一时间收集的数据残差与另一时间收集的数据残差相关。要检验这一假设,可将任何顺序变量(如时间)与标准化残差绘制成图。如果出现模式,则违反了这一假设。除非数据中存在顺序变量,否则这一假设通常是成立的。
(5) The Residuals Have a Constant Variance (no Heteroscedasticity) (5) 残差方差恒定(无异方差性)
To test for heteroscedasticity, plot the standardized residuals against the standardized predicted values (refer back to Fig. 10.26). If there are no distinct patterns and the data is randomly distributed, this assumption is satisfied. 要检验是否存在异方差,请绘制标准化残差与标准化预测值的对比图(参见图 10.26)。如果没有明显的模式,且数据是随机分布的,则符合这一假设。
Fig. 10.25 Plotting a residual analysis 图 10.25 绘制残差分析图
Fig. 10.26 Residual plot 图 10.26 残差图
Fig. 10.27 Normal P-P plot 图 10.27 正常 P-P 图
Normal P-P Plot of Regression Standardized Residual Dependent Variable: Revenue 回归标准化残差的正态 P-P 图,因变量:收入
(6) The Residuals Are Normally Distributed (6) 残差呈正态分布
When we estimated the regression, we suggested estimating a histogram and normal probability plots. The normal P-P plot, for a normal distribution, should show the data tightly clumped along the line. In Fig. 10.27, the data shows a slight S-curve along the line. 在估计回归时,我们建议估计直方图和正态概率图。对于正态分布,正态 P-P 图应显示数据沿直线紧密成团。在图 10.27 中,数据沿直线呈轻微的 S 型曲线。
The histogram in Fig. 10.28 appears peaked and drawn out to the left. 图 10.28 中的直方图出现峰值并向左拉开。
Another way to test for normality is to look at the skewness and kurtosis statistics for the standardized residuals. Skewness is how much the data is skewed in either a positive or negative direction. Kurtosis is a measure of how peaked or flat the distribution is. Most computer software normalizes these measures so that normal is when they are between -1 and 1 . Another way to put it is skewness and kurtosis should be below an absolute value of 1 . 检验正态性的另一种方法是查看标准化残差的偏度和峰度统计量。偏度是指数据向正或负方向倾斜的程度。峰度是指分布的峰值或平坦程度。大多数计算机软件都会对这些指标进行归一化处理,当它们介于-1 和 1 之间时即为正常。另一种说法是偏度和峰度的绝对值应低于 1。
We saved the standardized residuals when we estimated the regression (see Fig. 10.22). Choose Analyze >> Descriptive Statistics, Descriptives. Put the standardized residuals into the Variable(s) box. Choose options and untick everything, then tick skewness and kurtosis. Click continue and OK (see Fig. 10.29). 我们在估计回归时保存了标准化残差(见图 10.22)。选择分析 >> 描述性统计,描述符。将标准化残差放入变量框中。选择选项,取消所有勾选,然后勾选偏度和峰度。单击继续和确定(见图 10.29)。
Table 10.20 shows the results. The skewness of -1.297 suggests that the distribution is slightly skewed to the right. Given the robustness of OLS regression analysis, this skewness, which is slightly above the absolute value of 1 , would not overly concern us. The kurtosis is 5.878 , which is way above the recommended cutoff of 1 . This leads us to question the validity of the regression analysis. 表 10.20 显示了结果。偏度为 -1.297 表明分布略微向右倾斜。考虑到 OLS 回归分析的稳健性,这个略高于绝对值 1 的偏度不会让我们过于担心。峰度为 5.878,远高于建议的 1 临界值。这让我们怀疑回归分析的有效性。
For the sake of simplicity in explaining the regression example, we have not removed the outliers. In a normal analysis, we would have tested the regression 为了简化回归示例的解释,我们没有删除异常值。在正常分析中,我们会测试回归结果
Fig. 10.28 Residual histogram 图 10.28 残差直方图
without the outliers. In fact, by removing four outliers the regression performed much better with respect to the assumptions. Interestingly, it did not change the interpretation. This points to the fact that regression is quite robust even when the assumptions are violated. 没有异常值。事实上,去掉四个异常值后,回归结果在假设条件方面的表现要好得多。有趣的是,这并没有改变解释。这说明,即使在违反假设的情况下,回归也是相当稳健的。
(7) No Multicollinearity (7) 无多重共线性
No independent variable should be a perfect linear function of another independent variable. That is, there is no perfect multicollinearity. If two independent variables are perfectly correlated, then in a regression, they explain the same thing. There is no point to having both variables in the equation. The generally accepted cutoff level indicating a problem with multicollinearity is a correlation coefficient higher than an absolute value of 0.9 . The first test of multicollinearity is to run a correlation matrix between the independent variables. We show the correlation matrix for price and number of employees in Table 10.21. At 0.489, it is well below the accepted cutoff. 任何自变量都不应该是另一个自变量的完全线性函数。也就是说,不存在完美的多重共线性。如果两个自变量完全相关,那么在回归中,它们解释的是同一件事。方程中的两个变量都没有意义。一般认为,表示存在多重共线性问题的临界值是相关系数的绝对值高于 0.9。多重共线性的第一个检验方法是运行自变量之间的相关矩阵。表 10.21 显示了价格与员工人数的相关矩阵。相关系数为 0.489,远低于公认的临界值。
If the correlation coefficient was high enough to raise our concerns, then in the regression analysis we could ask for Collinearity diagnostics. Choose statistics, and tick collinearity diagnostics (see Fig. 10.30). 如果相关系数高到足以引起我们的关注,那么在回归分析中我们可以要求进行线性诊断。选择统计量,勾选线性诊断(见图 10.30)。
They show up to the right of the coefficients table (see Table 10.22). 它们显示在系数表的右侧(见表 10.22)。
The generally accepted cutoff indicating a problem with multicollinearity is a VIF number above 10 (Hair, Black, Babin & Anderson, 2014). VIF stands for variance inflation factor. At 1.314, it is well below the cutoff. We conclude that this assumption is satisfied. 一般认为,VIF 值超过 10,就表明存在多重共线性问题(Hair, Black, Babin & Anderson, 2014)。VIF 代表方差膨胀因子。1.314 远远低于临界值。我们得出结论,这一假设得到了满足。
In this chapter, we have described ordinary least squares (OLS) regression. It is a dependency technique where one or more independent variables (X_(i))\left(X_{\mathrm{i}}\right) predict the value of one dependent variable (Y)(Y). It is one of the most widely used statistical methods that predict outcomes. The dependent variable must be measured at the interval or ratio level, though it is common to use Likert-type measures and treat them as approximately interval. That is, we treat them as a continuous variable. The independent variables can be measured at any level. Ordinal variables should be converted to nominal variables: one less nominal variable than the number of ordinal categories. There are seven classic assumptions for multiple regression analysis. 在本章中,我们介绍了普通最小二乘法 (OLS) 回归。它是一种依赖技术,由一个或多个自变量 (X_(i))\left(X_{\mathrm{i}}\right) 预测一个因变量 (Y)(Y) 的值。它是应用最广泛的预测结果的统计方法之一。因变量必须在区间或比率水平上测量,不过通常使用李克特类型的测量方法,并将其视为近似区间。也就是说,我们将它们视为连续变量。自变量可以在任何水平上测量。序数变量应转换为名义变量:比序数类别的数量少一个名义变量。多元回归分析有七个经典假设。
While important, OLS regression is fairly robust against violations of the assumptions. 尽管 OLS 回归很重要,但它对违反假设的情况相当稳健。
References 参考资料
Aguinis, H., Gottfredson, R. K., & Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods, 16(2), 270-301. Aguinis, H., Gottfredson, R. K., & Joo, H. (2013)。定义、识别和处理异常值的最佳实践建议。组织研究方法》,16(2),270-301。
Hair, J. F., Black, W. C., Babin, B. J. & Anderson, R. E. (2014). Multivariate Data Analysis, 7th edition, International edition, Pearson Education Limited. Hair, J. F., Black, W. C., Babin, B. J. & Anderson, R. E. (2014).多元数据分析》,第 7 版,国际版,培生教育有限公司。
^("a "){ }^{\text {a }}. Correlation is significant at the 0.01 level (2-tailed) ^("a "){ }^{\text {a }} 。相关性在 0.01 水平(双尾)上显著