这是用户在 2024-11-26 1:06 为 https://app.immersivetranslate.com/word/ 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Customer Churn Analysis
客户流失分析

1.Introduction

Customer churn is an important business issue, especially in today's competitive industry, where retaining existing customers is often more cost effective than acquiring new ones[1]. The loss of customers not only has a negative impact on a business's revenue and profits, it can also hurt the long-term development of the brand[2]. Therefore, in today's society, it is of great significance to analyze the key factors affecting customer turnover and formulate feasible strategies to reduce the turnover rate for the sustainable development of enterprises. The objective of this study is to explore the key drivers of customer churn through data analysis and make targeted business recommendations based on the results of these analyses to help enterprises improve customer retention and overall competitiveness. Customer loss (Exited) is defined as the dependent variable, where a value of 1 represents customer loss and a value of 0 represents customer retention.
客户流失是一个重要的商业问题,特别是在当今竞争激烈的行业,留住现有客户往往比获得新客户更具成本效益[1]。客户的流失不仅会对企业的收入和利润产生负面影响,还会损害品牌的长期发展[2]。因此,在当今社会,分析影响客户流失的关键因素,制定切实可行的降低流失率的策略,对于企业的可持续发展具有重要意义。本研究的目的是通过数据分析,探索客户流失的关键驱动因素,并根据这些分析的结果,有针对性地提出业务建议,以帮助企业提高客户保留和整体竞争力。 客户损失(已删除)被定义为因变量,其中值1表示客户损失,值0表示客户保留。

2. Data and Variables
2.数据和变量

2.1 Data Source
2.1数据源

The analysis is based on a publicly available customer churn dataset sourced from the Heywhale platform (accessible at Heywhale Dataset). The dataset comprises 10,000 records and 12 variables, capturing customer demographics, account details, and churn status. This dataset provides a comprehensive view of customer behavior and characteristics, making it ideal for identifying the factors influencing customer churn and supporting actionable insights.
该分析基于来自Heywhale平台的公开可用客户流失数据集(可在Heywhale Dataset访问)。该数据集包括10,000条记录和12个变量,捕获客户人口统计数据,帐户详细信息和流失状态。该数据集提供了客户行为和特征的全面视图,非常适合识别影响客户流失的因素并支持可操作的见解。

2.2 Data Dictionary
2.2数据字典

Variable Name
变量名

Type
类型

Description
描述

Exited
退出

Dependent
依赖

Whether the customer churned (1 for yes, 0 for no)
客户是否流失(1表示是,0表示否)

RowNumber

Index
指数

Unique identifier for each row in the dataset
数据集中每行的唯一标识符

CustomerId

Identifier
标识符

Unique identifier for each customer
每个客户的唯一标识符

Surname

Categorical
分类

Customer’s surname
客户姓

CreditScore

Numerical
数值

Customer’s credit score
客户信用评分

Geography
地理

Categorical
分类

Country of the customer (France, Germany, Spain)
客户所在国家(法国、德国、西班牙)

Gender
性别

Categorical
分类

Customer’s gender (Male, Female)
客户性别(男、女)

Age
年龄

Numerical
数值

Customer’s age
客户的年龄

Tenure
任期

Numerical
数值

Years of service with the company
为公司服务的年数

Balance
平衡

Numerical
数值

Customer’s account balance
客户账户余额

NumOfProducts
产品数量

Numerical
数值

Number of products used by the customer
客户使用的产品数量

HasCrCard

Categorical
分类

Whether the customer has a credit card (1 for yes, 0 for no)
客户是否有信用卡(1表示有,0表示没有)

IsActiveMember

Categorical
分类

Whether the customer is an active member (1 for yes, 0 for no)
客户是否为活跃会员(1表示是,0表示否)

EstimatedSalary
预计工资

Numerical
数值

Customer’s estimated annual salary
客户预计年薪

3. Data Processing and Transformation
3.数据处理和转换

Missing Values: No missing values were identified in the dataset.
缺失值:数据集中未发现缺失值。

Duplicate Values: No duplicate records were found.
重复值:未找到重复记录。

Variable Transformation: Categorical variables such as Gender and Geography were converted into factors for analysis.
变量转换:将性别和地理等类别变量转换为分析因素。

4. Initial Statistical Analysis
4.初始统计分析

4.1 Univariate Analysis
4.1单因素分析

Figure 1 shows the distribution of the dependent variable Exited, which represents customer churn. Out of 10,000 customers in the dataset, approximately 80% (blue bar) did not churn (Exited = 0), while 20% (red bar) churned (Exited = 1).
图1显示了因变量random的分布,它代表客户流失。在数据集中的10,000名客户中,大约80%(蓝色条)没有流失(流失= 0),而20%(红色条)流失(流失= 1)。

Figure 1. Distribution of Dependent Variable Exited
图1.因变量的分布

Figure 2 shows the distribution of four numerical variables: CreditScore (credit score), Age (age), Balance (account balance), and EstimatedSalary (estimated salary). CreditScore presents a slightly skewed distribution to the right, with most customers' credit scores clustered between 600 and 750, indicating good credit overall. Age is skewed to the left, with the majority of customers between the ages of 30 and 45, but there is also a significant older age outlier (70 + years). The distribution of Balance is special, many customers' account balances are zero, and the remaining customers' balances are distributed in a wide range, reflecting the large difference in the degree of participation of customers and the company's financial products. On the other hand, EstimatedSalary presents a uniform distribution, which indicates that the salary level of customers is relatively balanced on the whole.
图2显示了四个数值变量的分布:CreditScore(信用评分)、Age(年龄)、Balance(账户余额)和EstimatedSalary(估计工资)。CreditScore呈现出略微偏右的分布,大多数客户的信用评分集中在600到750之间,表明整体信用良好。年龄向左倾斜,大多数客户的年龄在30至45岁之间,但也有一个显着的老年人(70岁以上)。Balance的分布比较特殊,很多客户的账户余额为零,其余客户的余额分布比较广,反映出客户和公司理财产品的参与程度差异较大。另一方面,EstimatedSalary呈现均匀分布,表明客户的薪资水平总体上相对均衡。

Figure 2. Distribution of Numerical Variables
图2.数值变量的分布

4.2 Bivariate Analysis
4.2双变量分析

T-test analysis
t检验分析

The T-test is a commonly used statistical analysis method to compare whether there is a significant difference in the mean between two groups. Based on the principle of hypothesis testing, it evaluates the size of the mean difference between groups relative to the variation within groups by calculating the T statistic[3]. Common types of T-tests include independent-sample T-tests (which compare two sets of independent data, such as differences in income between men and women) and paired sample T-tests (which compare differences between the same group of individuals under two conditions, such as changes in grades before and after the intervention)[4]. In addition, the T-test assumes that the data is normally distributed and the variance is equal between the groups. In business data analysis, T-tests are widely used to explore differences between numerical variables (e.g., age, credit score) and groups of different categorical variables (e.g., whether customers churn)[5].
t检验是比较两组间平均值是否存在显著差异的常用统计分析方法。基于假设检验的原理,通过计算T统计量来评估组间平均差异相对于组内变异的大小[3]。T检验的常见类型包括独立样本T检验(比较两组独立数据,例如男性和女性之间的收入差异)和配对样本T检验(比较两种条件下同一组个体之间的差异,例如干预前后的等级变化)。此外,T检验假设数据呈正态分布,并且组间方差相等。 在商业数据分析中,T检验被广泛用于探索数值变量之间的差异(例如,年龄,信用评分)和不同分类变量的组(例如,客户流失)[5]

It can be seen from the results of T test that different variables have significant differences between the customer groups who are lost or not. First, CreditScore's average non-churn customer is 651.85, while the average churn customer is 645.35, with a T-value of 2.63 and a P-value of 0.0085, indicating a significant difference in credit scores between the two groups of customers. The difference in Age is more significant. The average age of non-lost customers is 37.41, while the average age of lost customers is as high as 44.84, the T-value is -30.42, and the P-value is less than 0.001, which indicates that age has a strong influence on customer loss, and the lost customers are generally older. In terms of Balance, the average account balance of losing customers is 91108.54, while the balance of non-losing customers is 72745.30, with a T-value of -12.47 and a P-value less than 0.001, indicating that high-balance customers are more likely to lose. This may reflect some dissatisfaction with the service experience of high-value customers. In addition, NumOfProducts also showed a significant difference, the average number of products held by non-losing customers was 1.54, while the number of lost customers was 1.48, the T-value was 3.70, and the P-value was less than 0.001.
从T检验的结果可以看出,不同的变量在流失与未流失的客户群体之间存在显著差异。首先,CreditScore的平均非流失客户为651.85,而平均流失客户为645.35,T值为2.63,P值为0.0085,表明两组客户之间的信用评分存在显著差异。年龄的差异更为显着。未流失客户的平均年龄为37.41岁,而流失客户的平均年龄高达44.84岁,T值为-30.42,P值小于0.001,表明年龄对客户流失有很强的影响,流失客户普遍年龄较大。从余额来看,亏损客户的平均账户余额为91108.54,未亏损客户的平均账户余额为72745.30,T值为-12.47,P值小于0.001,说明高余额客户更容易亏损。 这可能反映了对高价值客户的服务体验的一些不满。此外,NumOfProducts也显示出显著差异,非亏损客户持有产品的平均数量为1.54,而亏损客户为1.48,T值为3.70,P值小于0.001。

But the difference between Tenure (how long a customer has been with the company) and EstimatedSalary (estimated salary) is not significant. The P-values are all greater than 0.05, which indicates that these variables may not have a significant effect on customer churn. To sum up, Age, CreditScore, Balance, and NumOfProducts are important features that distinguish attrition from non-attrition, while Tenure and EstimatedSalary have limited impact on attrition.
但是Tenure(客户在公司工作的时间)和EstimatedSalary(估计工资)之间的差异并不显著。P值均大于0.05,这表明这些变量可能对客户流失没有显著影响。总而言之,年龄、信用评分、余额和产品数量是区分自然减员和非自然减员的重要特征,而任期和估计工资对自然减员的影响有限。

Table 1. T-Test Results
表1. t检验结果

Variable
可变

Non-Churned Mean
非搅拌平均值

Churned Mean
搅拌平均值

t-value
t值

p-value
p值

Confidence Interval
置信区间

CreditScore

651.85

645.35

2.63

0.0085

[1.66, 11.34]
[1.66,11.34]

Age
年龄

37.41

44.84

-30.42

< 0.001
% 3C0.001

[-7.91, -6.95]
【-7.91,-6.95】

Tenure
任期

5.03

4.93

1.38

0.1664

[-0.04, 0.24]
【-0.04,0.24】

Balance
平衡

72745.30

91108.54

-12.47

< 0.001
% 3C0.001

[-21250.22, -15476.26]
【-21250.22,-15476.26】

NumOfProducts
产品数量

1.54

1.48

3.70

< 0.001
% 3C0.001

[0.03, 0.11]
[0.03,0.11]

EstimatedSalary
预计工资

99738.39

101465.68

-1.20

0.2289

[-4541.66, 1087.08]
【-4541.66,1087.08

Chi-Square analysis
卡方分析

Chi-Square analysis is a common non-parametric statistical method used to assess whether there is a significant association between two class variables[6]. This method determines the independence of variables by comparing the difference between the actual observed value and the expected value. If the observed difference is significantly larger than would be expected in a random case, it indicates that there may be an association between the two variables[8]. The results of chi-square analysis usually include chi-square statistics, degrees of freedom (df), and p-values to determine the significance of the relationship. In business data analysis, Chi-square analysis is widely used to study the relationship between categorical variables[7].
卡方分析是一种常见的非参数统计方法,用于评估两个类别变量之间是否存在显着关联[6]。该方法通过比较实际观测值与期望值之间的差异来确定变量的独立性。如果观察到的差异显著大于随机情况下的预期,则表明两个变量之间可能存在关联[8]。卡方分析的结果通常包括卡方统计量、自由度(df)和p值,以确定关系的显著性。 在商业数据分析中,卡方分析被广泛用于研究分类变量之间的关系[7]

Table 2 presents the Chi-square test results for categorical variables to analyze the association between these variables and Exited customers. The results show a Geography Chi-square statistic of 301.26, a degree of freedom of 2, and a P-value of less than 0.001, indicating a significant association between the country of the customer and churn, which may reflect the influence of geographical factors on customer behavior. Gender (customer gender) also showed a significant association, with a P-value of less than 0.001. This suggests that gender plays a role in customer churn, whereas HasCrCard (having a credit card or not) has no significant association with customer churn, with a P-value of 0.4924. This suggests that whether a customer has a credit card has a smaller impact on churn and may not be a significant factor in determining customer retention. Finally, IsActiveMember (active membership or not) had a P-value of less than 0.001, indicating a significant association between active membership status and attrition. Inactive customers are more likely to churn, suggesting that companies need to focus on inactive customers and increase retention by strengthening customer engagement.
表2列出了分类变量的卡方检验结果,以分析这些变量与已确认客户之间的关联。结果显示,地理卡方统计量为301.26,自由度为2,P值小于0.001,表明客户所在国家与客户流失之间存在显著关联,这可能反映了地理因素对客户行为的影响。性别(客户性别)也显示出显着的关联,P值小于0.001。这表明,性别在客户流失中起着重要作用,而HasCrCard(有信用卡或没有)与客户流失没有显着关联,P值为0.4924。这表明,客户是否拥有信用卡对客户流失的影响较小,可能不是决定客户保留的重要因素。最后,IsActiveMember(活跃成员或非活跃成员)的P值小于0。001,表明活跃成员身份与自然减员之间存在显著关联。不活跃的客户更容易流失,这表明企业需要关注不活跃的客户,并通过加强客户参与度来提高留存率。

In summary, Geography, Gender, and IsActiveMember were the category variables that were significantly associated with customer churn, while HasCrCard did not show a significant impact on customer churn.
总之,地理,性别和IsActiveMember是与客户流失显着相关的类别变量,而HasCrCard没有显示出显着的影响客户流失。

Table 2. Chi-Square Test Results
表2.卡方检验结果

Variable
可变

Chi-Square Statistic
卡方统计

Degrees of Freedom (df)
自由度(DF)

p-value
p值

Geography
地理

301.26

2

< 0.001
% 3C0.001

Gender
性别

112.92

1

< 0.001
% 3C0.001

HasCrCard

0.47

1

0.4924

IsActiveMember

242.99

1

< 0.001
% 3C0.001

5. Statistical Measures and Visualization
5.统计测量和可视化

5.1 Correlation Analysis
5.1相关分析

Correlation analysis is a statistical method used to measure the strength and direction of a linear relationship between two or more variables[8]. Its core indicator is the correlation coefficient (such as Pearson's correlation coefficient), which ranges from -1 to 1. A positive correlation (close to 1) indicates that the two variables change in the same direction, a negative correlation (close to -1) indicates that the two variables change in the opposite direction, and a zero correlation indicates a wireless relationship between the two variables. The results of correlation analysis are often presented in the form of correlation matrices and visual heat maps to help researchers identify significant relationships between variables. In business data analysis, correlation analysis can reveal the underlying patterns and characteristics, and provide an important basis for variable selection and subsequent modeling. For example, by analyzing the correlation of customer characteristics, such as account balance and age, you can better understand the factors behind customer behavior to guide data-driven decision making and strategy optimization[9].
相关性分析是一种统计方法,用于测量两个或多个变量之间线性关系的强度和方向[8]。其核心指标是相关系数(如Pearson相关系数),范围为-1到1。正相关(接近1)表示两个变量在同一方向上变化,负相关(接近-1)表示两个变量在相反方向上变化,零相关表示两个变量之间的无线关系。相关性分析的结果通常以相关矩阵和可视化热图的形式呈现,以帮助研究人员识别变量之间的重要关系。 在业务数据分析中,相关性分析可以揭示底层模式和特征,为变量选择和后续建模提供重要依据。例如,通过分析客户特征的相关性,如账户余额和年龄,可以更好地了解客户行为背后的因素,以指导数据驱动的决策和策略优化[9]

Figure 3 shows the correlation matrix between variables, highlighting the relationships between key features in the data set. In general, the correlation between most variables is weak and the correlation coefficient is close to zero, which indicates that these variables are largely independent of each other, thus reducing the risk of multicollinearity in subsequent modeling. The strongest correlation appears between Balance (account balance) and NumOfProducts (quantity of products), with a correlation coefficient of -0.30, showing a moderate negative correlation. This suggests that customers with higher account balances tend to hold fewer products, possibly because these customers are more inclined to choose fewer but higher-value financial products. For correlations between other variables, such as Tenure (length of service) and Balance or Age (age) and CreditScore (credit score), the correlation coefficient is close to zero, indicating that there is little linear relationship between the two. Notably, EstimatedSalary showed little correlation with other variables, which is consistent with previous statistical tests, indicating limited predictive power for customer churn. Similarly, CreditScore has little correlation with other variables, further strengthening its role as an independent indicator of financial behavior.
图3显示了变量之间的相关性矩阵,突出显示了数据集中关键特征之间的关系。一般来说,大多数变量之间的相关性较弱,相关系数接近于零,这表明这些变量在很大程度上相互独立,从而降低了后续建模中多重共线性的风险。Balance(账户余额)和NumOfProducts(产品数量)之间的相关性最强,相关系数为-0.30,显示出中度负相关。这表明账户余额较高的客户倾向于持有较少的产品,可能是因为这些客户更倾向于选择数量较少但价值较高的金融产品。 对于其他变量之间的相关性,如Tenure(服务年限)和Balance或Age(年龄)和CreditScore(信用评分),相关系数接近于零,表明两者之间几乎没有线性关系。值得注意的是,EstimatedSalary与其他变量几乎没有相关性,这与之前的统计测试一致,表明对客户流失的预测能力有限。同样,CreditScore与其他变量的相关性很小,进一步加强了其作为金融行为独立指标的作用。

Figure 3. Correlation Matrix
图3.相关矩阵

5.2 Boxplots for Numerical Variables
5.2数值变量箱形图

Figure 4 shows a boxplot of the relationship between the dependent variable Exited (whether it is lost or not) and the numerical independent variable, from which differences in the distribution of different features between lost and non-lost customers can be observed. In terms of Age, the age distribution of attrition customers is significantly higher than that of non-attrition customers, with the median age of attrition customers being about 45 years old compared to about 36 years old, indicating that older customers are more likely to churn. In terms of Balance, churn customers generally have higher account balances, and the median is also higher than that of non-churn customers, which may reflect a higher risk of churn among high-balance customers. In terms of CreditScore, although the median score of churn customers was close to that of non-churn customers, the distribution of churn customers was slightly lower, suggesting that a lower credit score may be associated with churn. In terms of NumOfProducts, the distribution of the number of products held by non-lost customers is slightly higher than that of lost customers, indicating that customers with more products may have higher retention rates. On the other hand, EstimatedSalary (estimated salary) and Tenure (years of service) are not significantly different in the distribution of lost customers and non-lost customers, and the median of the two groups is basically the same. This suggests that wage levels and years of service have a weaker effect on customer churn.
图4示出了因变量counted(无论是否丢失)与数值自变量之间的关系的箱形图,从中可以观察到丢失和未丢失客户之间的不同特征的分布差异。从年龄来看,流失客户的年龄分布明显高于非流失客户,流失客户的中位年龄约为45岁,而非流失客户的中位年龄约为36岁,这表明年龄较大的客户更容易流失。在余额方面,流失客户的账户余额普遍较高,中位数也高于非流失客户,这可能反映了高余额客户的流失风险较高。在CreditScore方面,虽然流失客户的评分中位数与非流失客户的评分中位数接近,但流失客户的分布略低,表明较低的信用评分可能与流失有关。 从NumOfProducts来看,未流失客户持有产品数量的分布略高于流失客户,表明持有产品较多的客户可能留存率较高。另一方面,EstimatedSalary(预估薪资)和Tenure(服务年限)在流失客户和非流失客户的分布上没有显著差异,两组的中位数基本一致。这表明,工资水平和服务年限对客户流失的影响较小。

Figure 4. Boxplots of Numerical Variables
图4.数值变量箱形图

5.3 Categorical Variable Analysis
5.3分类变量分析

Figure 5 shows the distribution between the category-type variables (Gender, Geography, HasCrCard, and IsActiveMember) and the dependent variable Exited (whether it is lost or not). For Gender, the proportion of female customers is higher than that of male customers, and although there are more male customers overall, the proportion of male customers who do not churn is significantly higher than the proportion who churn, suggesting that gender may be an important factor in customer churn. In terms of Geography, there are obvious differences in the attrition rate among different regions. The proportion of churn in Germany was significantly higher than in France and Spain, suggesting that regional factors may be closely related to churn and require a differentiated approach to customers in different regions. For Hascrcards (whether they have a credit card or not), the difference between the percentage of customers who have and do not have a credit card is not significant. This suggests that whether a customer has a credit card may have less of an impact on churn, consistent with previous Chi-square test results. IsActiveMember (active membership or not) is another important factor. Inactive members (IsActiveMember = 0) had a much higher turnover rate than active members (IsActiveMember = 1). This indicates that active members have higher customer retention rates, suggesting that companies need to increase their focus on inactive customers and increase their engagement.
图5显示了类别类型变量(Gender、Geography、HasCrCard和IsActiveMember)和因变量(是否丢失)之间的分布。对于性别,女性客户的比例高于男性客户,尽管总体上男性客户较多,但男性客户不流失的比例明显高于流失的比例,这表明性别可能是客户流失的重要因素。从地域上看,不同地区的流失率存在明显差异。德国的客户流失比例明显高于法国和西班牙,这表明区域因素可能与客户流失密切相关,需要对不同区域的客户采取差异化的方法。 对于Hascrcards(无论他们是否拥有信用卡),拥有和没有信用卡的客户百分比之间的差异并不显著。这表明客户是否拥有信用卡对流失的影响可能较小,这与之前的卡方测试结果一致。IsActiveMember(活跃会员或非活跃会员)是另一个重要因素。不活跃的成员(IsActiveMember = 0)比活跃的成员(IsActiveMember = 1)有更高的流动率。这表明活跃会员拥有更高的客户保留率,这表明公司需要增加对非活跃客户的关注并提高他们的参与度。

Figure 5. Categorical Variables and Churn
图5.分类变量和客户流失

6. Model Building
6.模型构建

6.1 Logistic Regression
6.1 Logistic回归

Logistic regression is a statistical model widely used in binary classification problems to predict the relationship between a binary dependent variable (such as whether to run off) and multiple independent variables[10]. By estimating the regression coefficients of independent variables, logistic regression models can quantify the influence of each variable on the probability of occurrence of dependent variables. The study used binary logistic regression to analyze drivers of Exited customers, where the dependent variable was whether customers were lost, and the independent variables included demographic characteristics (such as Age and Gender), geographic information (such as Geography), and financially relevant characteristics (such as Balance and CreditScore)[11]. The model was fitted with maximum likelihood estimation, and the significance (p-value) and regression coefficient (Estimate) of each variable were evaluated to identify the key influencing factors[12].
Logistic回归是一种广泛用于二元分类问题的统计模型,用于预测二元因变量(如是否运行)与多个自变量之间的关系[10]。Logistic回归模型通过估计自变量的回归系数,可以量化各变量对因变量出现概率的影响。 该研究使用二元逻辑回归来分析客户流失的驱动因素,其中因变量是客户是否流失,自变量包括人口统计特征(如年龄和性别),地理信息(如地理)和财务相关特征(如余额和信用评分)[11]。模型采用最大似然估计进行拟合,并评估每个变量的显著性(p值)和回归系数(估计值),以确定关键影响因素[12]

Table 3 shows the results of a logistic regression model where several variables significantly affect the probability of customer churn. For example, Age has a significant positive effect on churn probability (p < 0.001), indicating that older customers are more likely to churn. Similarly, Balance also has a significant positive effect on churn (p < 0.001), indicating that customers with high balances have a higher risk of churn. Geography shows that the turnover rate of German customers is significantly higher than that of French customers, while the difference between Spanish customers and French customers is not significant. In addition, IsActiveMember (whether an active member is active or not) was significantly negatively associated with the probability of churn (p < 0.001), indicating that active customers are more likely to retain. On the other hand, certain variables have a weak or insignificant effect on customer churn, such as HasCrCard (whether you have a credit card, p = 0.4515) and EstimatedSalary (estimated salary, p = 0.3102). These variables may not be significant factors affecting churn. In the future, we will use these variables with significant P-values to construct prognostic models.
表3显示了逻辑回归模型的结果,其中几个变量显着影响客户流失的概率。例如,年龄对客户流失概率有显著的正影响(p % 3 C 0. 001),表明年龄较大的客户更有可能流失。同样,余额对客户流失也有显著的积极影响(p % 3 C 0. 001),表明余额高的客户流失风险更高。地理位置显示,德国客户的流失率明显高于法国客户,而西班牙客户与法国客户的差异并不显著。此外,IsActiveMember(无论活跃成员是否活跃)与流失概率显著负相关(p< 0.001),表明活跃客户更有可能保留。 另一方面,某些变量对客户流失的影响很弱或不显著,例如HasCrCard(是否有信用卡,p = 0.4515)和EstimatedSalary(估计工资,p = 0.3102)。这些变量可能不是影响客户流失的重要因素。将来,我们将使用这些具有显著P值的变量来构建预后模型。

Table 3. Logistic Regression Result
表3. Logistic回归结果

Variable
可变

Estimate
估计

Std. Error
STD.误差

z-value
z值

p-value
p值

(Intercept)
(截点)

-3.392

0.2448

-13.857

<0.001

CreditScore

-6.683e-04

2.803e-04

-2.384

0.0171

GeographyGermany
地理德国

7.747e-01

6.767e-02

11.448

<0.001

GeographySpain
地理西班牙

3.522e-02

7.064e-02

0.499

0.6181

GenderMale
性别男性

-5.285e-01

5.449e-02

-9.699

<0.001

Age
年龄

7.271e-02

2.576e-03

28.230

<0.001

Tenure
任期

-1.595e-02

9.355e-03

-1.705

0.0882

Balance
平衡

2.637e-06

5.142e-07

5.128

<0.001

NumOfProducts
产品数量

-1.015e-01

4.713e-02

-2.154

0.0312

HasCrCard1
HasCrCard 1

-4.468e-02

5.934e-02

-0.753

0.4515

IsActiveMember1
IsActiveBracket 1

-1.075e+00

5.769e-02

-18.643

<0.001

EstimatedSalary
预计工资

4.807e-07

4.737e-07

1.015

0.3102

Figure 6 shows the diagnostic graph of the logistic regression model, which is used to evaluate the fit quality of the model and whether the assumptions are satisfied. The Residuals vs Fitted diagram at the top left shows the relationship between the residual and the predicted value. Most of the data points are clustered around zero, but some anomalies can be observed, indicating that the model may have large prediction errors in some cases. Nonetheless, there is no clear pattern to the overall trend, suggesting a reasonable distribution of residuals. The "Q-Q Residuals" plot in the upper right corner checks the normality of the standardized residuals. As you can see from the graph, the data points are generally distributed along a diagonal line, although there is some bias at high deviation values. This shows that the model's residuals roughly satisfy the assumption of a normal distribution, and the "Scale-Location" plot at the bottom left is used to check the uniformity of the residuals. The graph shows that the standardized value of the residual remains relatively stable as the predicted value changes, but is somewhat dispersed over the low predicted value range. On the whole, the heteroscedasticity of the model is weak, which meets the requirements of analysis. The Residuals vs Leverage plot in the lower right corner is used to identify potential points of high leverage and strong influence. Although most of the data points are densely distributed, the model is reasonable.
图6显示了逻辑回归模型的诊断图,该图用于评估模型的拟合质量以及是否满足假设。左上角的残差与拟合图显示残差与预测值之间的关系。大多数数据点都聚集在零附近,但可以观察到一些异常,表明模型在某些情况下可能存在较大的预测误差。尽管如此,总体趋势没有明确的模式,表明残差分布合理。右上角的“Q-Q残差”图检查标准化残差的正态性。从图中可以看到,数据点通常沿沿着分布,尽管在高偏差值处存在一些偏差。 这表明模型的残差大致满足正态分布的假设,左下方的“比例-位置”图用于检查残差的均匀性。该图显示,残差的标准化值随着预测值的变化而保持相对稳定,但在低预测值范围内有些分散。总体上看,模型的异方差性较弱,满足分析要求。右下角的Resistance vs Leverage图用于识别高杠杆和强影响力的潜在点。虽然大多数数据点分布密集,但该模型是合理的。

Figure 6. Logistic Regression Diagnostics
图6.逻辑回归诊断

6.2 XGBoost Model
6.2 XGBoost模型

XGBoost (Extreme Gradient Boosting) is an efficient machine learning algorithm based on gradient tree, which is widely used in classification and regression problems[13]. The algorithm aims to optimize performance and computational efficiency, and gradually reduces the prediction error of the model by building a weak prediction model (usually a decision tree) in stages. XGBoost is known for its speed, accuracy, and ability to handle large amounts of data[14]. The key parameters include tree depth (max depth), learning rate (eta) and number of iterations (nrounds). In this study, the fixed parameters are max depth = 6, eta = 0.05 and nrounds = 100, which are used to predict customer churn.
XGBoost(Extreme Gradient Boosting)是一种基于梯度树的高效机器学习算法,广泛应用于分类和回归问题[13]。该算法以优化性能和计算效率为目标,通过分阶段构建弱预测模型(通常为决策树),逐步降低模型的预测误差。XGBoost以其速度,准确性和处理大量数据的能力而闻名[14]。关键参数包括树深度(max depth)、学习率(eta)和迭代次数(nrounds)。在本研究中,固定参数为max depth = 6,eta = 0.05和nrounds = 100,用于预测客户流失。

The performance evaluation indicators of the model show that XGBoost performs well on the test data set. The prediction accuracy of the model is 86.43%, which shows high classification ability. In addition, the AUC (area under the curve) of the model is 0.8676, indicating that the model has a strong discriminating power in distinguishing between lost customers and non-lost customers (see Figure 7). ROC curve clearly shows the classification performance of the model under different thresholds, and its curve is far away from the diagonal of random classification, which further proves the robustness and efficiency of the model[15].
模型的性能评估指标表明,XGBoost在测试数据集上表现良好。该模型的预测准确率为86.43%,具有较高的分类能力。此外,该模型的AUC(曲线下面积)为0.8676,表明该模型在区分流失客户和非流失客户方面具有很强的区分能力(见图7)。ROC曲线清楚地显示了模型在不同阈值下的分类性能,其曲线远离随机分类的对角线,进一步证明了模型的鲁棒性和高效性[15]

Figure 7. ROC Curve
图7. ROC曲线

Feature Importance
特征重要性

Feature importance analysis is an important method of machine learning model interpretation, which is used to quantify the contribution of each feature to the model's prediction results[16]. In the XGBoost model, feature importance can be assessed by the "Gain" metric, which measures a feature's contribution to reducing error when splitting tree nodes. By calculating the average contribution of features across all trees, it is possible to rank the most influential variables in the model. This analysis can not only help to understand the decision-making mechanism of the model, but also provide reference for feature selection and business strategy optimization.
特征重要性分析是机器学习模型解释的一种重要方法,用于量化每个特征对模型预测结果的贡献[16]。在XGBoost模型中,特征重要性可以通过“增益”度量来评估,该度量度量在分割树节点时测量特征对减少错误的贡献。通过计算所有树中特征的平均贡献,可以对模型中最具影响力的变量进行排名。该分析不仅有助于理解模型的决策机制,而且可以为特征选择和业务策略优化提供参考。

Figure 8 shows the results of the feature importance analysis in the XGBoost model. As can be seen from the figure, Age is the feature that has the greatest impact on predicting customer churn, followed by NumOfProducts (number of products) and Balance (account balance), ranking second and third respectively. This suggests that customers who are older, have higher account balances, and hold fewer products are more likely to churn. In addition, IsActiveMember (active membership) and CreditScore (credit score) also contributed to predicting churn, reflecting the importance of customer activity and credit performance. Geography (region) and Gender (gender), although having some effect on attrition rates, are of relatively low importance in the model.
图8显示了XGBoost模型中特征重要性分析的结果。从图中可以看出,年龄是对预测客户流失影响最大的特征,其次是NumOfProducts(产品数量)和Balance(账户余额),分别排名第二和第三。这表明,年龄较大、账户余额较高、持有产品较少的客户更有可能流失。此外,IsActiveMember(活跃会员)和CreditScore(信用评分)也有助于预测客户流失,反映了客户活动和信用表现的重要性。地理(区域)和性别(性别)虽然对自然减员率有一定影响,但在模型中的重要性相对较低。

Figure 8. Feature Importance
图8.特征重要性

7. Conclusion and Implications
7.结论与启示

7.1 Key Findings
7.1主要结论

Some important conclusions can be drawn from the customer churn analysis. First, age, account balance, and number of products are important factors in predicting customer churn. Older customers, customers with higher account balances, and customers with fewer products are more likely to churn. In addition, whether the customer is an active member (IsActiveMember) and the customer's creditworthiness (CreditScore) also have a significant impact on the probability of churn. Inactive customers are more likely to churn, and customers with lower credit scores have a higher risk of churn. Geographic location also plays an important role in immigration rates. In particular, in Germany, the customer churn rate is significantly higher than in France and Spain. However, variables such as HasCrCard Ownership and Estimated Sales are less important in the model, indicating that their ability to predict customer churn is limited[17][18].
客户流失分析可以得出一些重要的结论。首先,年龄、账户余额和产品数量是预测客户流失的重要因素。年龄较大的客户、账户余额较高的客户以及产品较少的客户更有可能流失。此外,客户是否为活跃会员(IsActiveMember)和客户的信誉度(CreditScore)也对流失概率有显著影响。不活跃的客户更有可能流失,信用评分较低的客户流失风险更高。地理位置在移民率中也起着重要作用。特别是在德国,客户流失率明显高于法国和西班牙。 然而,HasCrCard所有权和估计销售额等变量在模型中不太重要,这表明它们预测客户流失的能力有限[17][18]

7.2 Business Strategies
7.2业务策略

Based on the results of the analysis, we propose several business strategies to reduce customer churn and increase customer loyalty. We believe that older customers with higher account balances can be personalized services and offers to improve customer satisfaction. For example, design value-added services or loyalty programs for high-balance customers to enhance customer loyalty. For non-active customers, more frequent and precise communication strategies are implemented, mainly through customized marketing campaigns or regular offers, to increase their engagement and reduce the risk of churn. It also encourages customers to choose a more diverse product mix, thereby reducing the risk of customer churn. When analyzing customers' use of existing products, timely adjust product functions to better meet customer needs. With regard to the high churn rate of German customers, regional solutions should be developed to strengthen the resource investment of customer service and marketing teams in Germany to better serve local customers. Finally, clearer credit education and management suggestions can be provided for customers with low credit scores, so as to enhance customers' trust in the company's products and services, thereby reducing the risk of loss caused by credit problems[19][20].
根据分析结果,我们提出了几种减少客户流失和提高客户忠诚度的业务策略。我们认为,账户余额较高的老年客户可以获得个性化的服务和优惠,以提高客户满意度。例如,为高余额客户设计增值服务或忠诚度计划,以提高客户忠诚度。对于非活跃客户,我们会实施更频繁、更精确的沟通策略,主要是通过定制的营销活动或定期优惠,以提高他们的参与度并降低流失风险。它还鼓励客户选择更多样化的产品组合,从而降低客户流失的风险。在分析客户对现有产品的使用情况时,及时调整产品功能,以更好地满足客户需求。 针对德国客户流失率高的问题,应制定区域解决方案,加强德国客户服务和营销团队的资源投入,更好地服务当地客户。最后,可以为信用评分较低的客户提供更清晰的信用教育和管理建议,以增强客户对公司产品和服务的信任,从而降低因信用问题造成损失的风险

7.3 Limitations
7.3限制

Although this study provides many valuable insights, it also has some limitations. First of all, the data set comes from a public platform and may not fully reflect the actual situation of a specific enterprise, so the applicability of the analysis results may be limited. Secondly, the selection of variables in the model is based on existing data, which may miss other potentially important influencing factors, such as customer satisfaction with the service or market competition intensity. In addition, although XGBoost, logistic regression and other methods provide high prediction accuracy, the interpretability of the model is still limited in some cases. Finally, as this study focuses on static data, it lacks the support of time series or dynamic behavior data, which may limit insight into long-term trends in customer behavior[21].
虽然这项研究提供了许多有价值的见解,但它也有一些局限性。首先,数据集来自公共平台,可能无法完全反映具体企业的实际情况,因此分析结果的适用性可能有限。其次,模型中变量的选择是基于现有数据,这可能会遗漏其他潜在的重要影响因素,如客户对服务的满意度或市场竞争强度。此外,尽管XGBoost、逻辑回归等方法提供了较高的预测精度,但在某些情况下,模型的可解释性仍然有限。最后,由于本研究侧重于静态数据,因此缺乏时间序列或动态行为数据的支持,这可能会限制对客户行为长期趋势的洞察[21]

References
引用

Reichheld, F. F., & Sasser, W. E. (1990). Zero defections: Quality comes to services. Harvard Business Review, 68(5), 105–111.
赖希霍尔德角,澳-地F.、&Sasser,W. E.(1990年)。零缺陷:服务质量。哈佛商业评论,68(5),105-111。

Blattberg, R. C., Kim, B. D., & Neslin, S. A. (2001). Customer equity: Building and managing relationships as valuable assets. Harvard Business School Press.
布拉特贝里河,巴西-地C.的方法,金,B。D、&奈斯林S。A.(2001年)的第10页。客户资产:建立和管理作为宝贵资产的关系。哈佛商学院出版社.

Student, W. S. G. (1908). The probable error of a mean. Biometrika, 6(1), 1-25.
学生,W。S. G.(一九○八年)。平均数的可能误差。《生物统计学》,6(1),1-25。

Fisher, R. A. (1925). Statistical methods for research workers. Oliver and Boyd.
费希尔河A.(一九二五年)。研究工作者的统计方法。奥利弗和博伊德。

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
科恩,J.(1988)。行为科学的统计功效分析(第2版)。劳伦斯·厄尔鲍姆律师事务所。

Agresti, A. (2018). Statistical methods for the social sciences (5th ed.). Pearson Education.
阿格雷斯蒂A.(2018年)。社会科学统计方法(第五版)。培生教育

Siegel, S., & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences (2nd ed.). McGraw-Hill.
西格尔,S.,&Castellan,N. J.(1988年)。行为科学的非参数统计(第2版)。麦格劳-希尔

Pearson, K. (1895). Note on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London, 58, 240–242.
皮尔逊,K。(一八九五年)。关于双亲情况下的回归和遗传的注记。Proceedings of the皇家学会的伦敦,58,240-242。

Mukaka, M. M. (2012). A guide to appropriate use of correlation coefficient in medical research. Malawi Medical Journal, 24(3), 69–71.
Mukaka,M. M.(2012年)。相关系数在医学研究中的正确使用指南。Malawi Medical Journal,24(3),69-71.

Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215–242.
考克斯,D. R.(一九五八年)。二进制序列的回归分析。Journal of the皇家统计学会:Series B(Methodological),20(2),215-242.

Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (3rd ed.). Wiley.
霍斯默,D。W.的,Lemeshow,S.,&斯图尔迪万特河X.(2013年)。应用逻辑回归(第三版)。威利。

Menard, S. (2002). Applied logistic regression analysis (2nd ed.). Sage.
梅纳尔,S.(2002年)的报告。应用逻辑回归分析(第二版)。圣人

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). ACM.
陈,T.,&盖斯特林角(2016年)。XGBoost:一个可扩展的树提升系统。在第22届ACM SIGKDD知识发现和数据挖掘国际会议上,第785-794页)。ACM。

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., ... & Liu, T. Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3149–3157.
克,G.,孟,Q.芬利,T.,王,T.,陈伟,妈妈,W.,... &Liu,T. Y.(2017年)。LightGBM:一种高效的梯度提升决策树。神经信息处理系统进展,30,3149-3157。

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232.
弗里德曼,J.H.(2001年)的第10页。贪婪函数逼近:一个梯度推进机。Annals of Statistics,29(5),1189-1232.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
布赖曼湖(2001年)的第10页。乱林。《机器学习》,45(1),5-32。

Jeni, L. A., Cohn, J. F., & De La Torre, F. (2013). Facing imbalanced datarecommendations for the use of performance metrics. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (pp. 245251). IEEE.
杰尼湖一、科恩,J.F.,&De La Torre,F.(2013年)。面对不平衡的数据-使用性能度量的建议。在2013年人类协会情感计算和智能交互会议上(pp。第245-251段)。美国电气与电子工程师协会。

King, G., & Zeng, L. (2001). Logistic regression in rare events data. Political Analysis, 9(2), 137–163.
金,G.,&曾湖,加-地(2001年)的第10页。罕见事件数据的Logistic回归。Political Analysis,9(2),137-163.

Oliver, R. L. (1999). Whence consumer loyalty? Journal of Marketing, 63(Special Issue), 33–44.
奥利弗河L.(1999年)。消费者忠诚度从何而来?Journal of Marketing,63(Special Issue),33-44.

Kotler, P., & Keller, K. L. (2016). Marketing management (15th ed.). Pearson Education.
科特勒,P.,&凯勒,K. L.(2016年)。市场营销管理(第15版)。培生教育

Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27(8), 861874.
福塞特,T.(2006年)。ROC分析简介。模式识别字母,27(8),861-874。