17 - Predictive Models 101
Contents
17 - Predictive Models 101#
We are leaving Part I of this book. That part covered the core about causal inference. Techniques over there are very well known and established. They have survived the test of time. Part I builds the solid foundation we can rely upon.
我们即将结束本书的第一部分。这部分涵盖了因果推理的核心内容。那里的技术非常著名且已成熟。它们经受住了时间的考验。第一部分为我们奠定了坚实的基础,可以依赖。
In more technical terms, Part I focuses on defining what is causal inference, what are the biases that prevents correlation from being causation, multiple ways to adjust for those biases (regression, matching and propensity score) and canonical identification strategies (instrumental variables, diff-in-diff and RDD).
从更技术的角度来看,第一部分重点定义了什么是因果推断、哪些偏差阻碍了相关性成为因果关系、调整这些偏差的多种方法(回归、匹配和倾向评分)以及经典的识别策略(工具变量、差异中的差异和 RDD)。
In summary, Part I focuses on the standard techniques we use to identify the average treatment effect
总之,第一部分重点介绍了我们用来识别平均处理效应
As we move to Part II, things will get a bit shaky. We will cover recent developments in the causal inference literature, its relationship with Machine Learning and applications in the industry. In that sense, we trade-off academic rigour for applicability and empiricism.
随着我们进入第二部分,事情会变得有些不稳定。我们将涵盖因果推理文献中的最新进展,它与机器学习的关系以及在行业中的应用。从这个意义上说,我们在学术严谨性和实用性及经验主义之间进行了权衡。
Some methods presented in Part II don’t have a solid theory about why they work. Still, when we try them, they seem to work nevertheless.
第二部分中提出的一些方法并没有坚实的理论来解释为什么它们有效。然而,当我们尝试这些方法时,它们似乎仍然有效。
In that sense, Part II might be more useful for industry practitioners that want to use causal inference in their day to day work, rather than scientists who want to research a fundamental causal relationship in the world.
从这个意义上说,第二部分可能对希望在日常工作中使用因果推理的行业从业者更有用,而不是对希望研究世界基本因果关系的科学家。
The first few chapters of part two will focus on estimating heterogeneous treatment effects. We will move from a world where all we cared about was the average treatment effect,
第二部分的前几章将重点讨论估计异质性处理效应。我们将从一个只关心平均处理效应
In a sense, we are also moving from a positive question about what is the average treatment effect to a normative question: who should we treat?
在某种意义上,我们也在从一个关于平均治疗效果是什么的实证问题转向一个规范性问题:我们应该治疗谁?
This is the question most businesses ask themselves, albeit in slightly different terms: who should I give discounts to? What interest rate should I charge on a loan? What item should I recommend to this user? What page layout should I show to each customer?
这是大多数企业都会问自己的问题,尽管措辞略有不同:我应该给谁折扣?我应该对贷款收取多少利率?我应该向这个用户推荐什么商品?我应该向每个客户展示什么样的页面布局?
Those are all treatment effect heterogeneity questions that we can answer with the tools presented in Part II.
这些都是我们可以用第二部分中介绍的工具来回答的处理效应异质性问题。
But before we do that, it’s only fair that I present what Machine Learning means to the industry, as this will become a fundamental tool we will later use for causal inference.
但在我们这样做之前,我有必要先介绍一下机器学习对行业意味着什么,因为这将成为我们之后用于因果推理的基本工具。
Machine Learning in the Industry#
工业中的机器学习 #
The focus of this chapter is to talk about how we normally use machine learning in the industry. If you are not familiar with machine learning, you can see this chapter as a machine learning crash course. And if you’ve never worked with ML before, I strongly recommend you learn at least the basics to get the most out of what’s to come.
本章的重点是讨论我们通常如何在行业中使用机器学习。如果你不熟悉机器学习,你可以将本章视为一个机器学习速成课程。如果你从未接触过机器学习,我强烈建议你至少学习基础知识,以便更好地理解接下来的内容。
But this doesn’t mean you should skip this chapter if you are already versed in ML. I still think you will benefit from reading it through. Differently from other machine learning material, this one will not discuss the ins and outs of algorithms like decision trees or neural networks. Instead, it will be laser focused on how machine learning is applied in the real world.
但这并不意味着如果你已经熟悉机器学习就可以跳过这一章。我仍然认为通读它对你有好处。与其他机器学习材料不同,这部分内容不会讨论决策树或神经网络等算法的细节。相反,它将重点关注机器学习在现实世界中的应用。
The first thing I want to adress is why are we talking about machine learning in a causal inference book? The short answer is because I think one of the best ways to understand causality is to put it in contrast with the predictive models approach brought by machine learning.
我想首先讨论的是,为什么在一本关于因果推断的书中要谈论机器学习?简短的回答是因为我认为理解因果关系的最佳方法之一是将其与机器学习带来的预测模型方法进行对比。
The long answer is twofold. First, if you’ve got to this point in this book, there is a high chance you are already familiar with machine learning. Second, even if you aren’t, given the current popularity of these topics, you probably already have some idea on what they are.
长答案是两方面的。首先,如果你已经读到了这本书的这一点,很有可能你已经对机器学习有所了解。其次,即使你不了解,鉴于这些话题目前的流行程度,你可能也已经对它们有些概念了。
The only problem is that, with all the hype around machine learning, I might have to bring you back to earth and explain what it really does in very practical terms. Finally, more recent developments in causal inference make heavy use of machine learning algorithms, so there is that too.
唯一的问题是,由于围绕机器学习的炒作太多,我可能不得不让你回到现实,用非常实际的术语解释它真正的作用。最后,因果推理的最新发展大量使用了机器学习算法,所以也有这一点。
Being very direct, machine learning is a way to make fast, automatic and good predictions. That’s not the entire picture, but it covers 90% of it. It’s in the field of supervised machine learning where most of the cool advancements, like computer vision, self-driving cars, language translation and diagnostics, have been made.
直接来说,机器学习是一种快速、自动且准确地进行预测的方法。这并不是全部,但涵盖了其中的 90%。在监督式机器学习领域,大多数酷炫的进步,如计算机视觉、自动驾驶汽车、语言翻译和诊断,都已经实现。
Notice how, at first, these might not seem like prediction tasks. How is language translation a prediction? And that’s the beauty of machine learning. We can solve more problems with prediction than what is initially apparent.
请注意,起初这些任务可能看起来并不像是预测任务。语言翻译怎么会是预测呢?这就是机器学习的魅力所在。我们能够用预测解决比最初显现的更多的问题。
In the case of language translation, you can frame it as a prediction problem where you present a machine with one sentence and it has to predict the same sentence in another language. Notice that I’m not using the word prediction in a forecasting or anticipating the future sense. Prediction is simply mapping from one defined input to an initially unknown but equally well defined and observable output.
在语言翻译的情况下,你可以将其构建成一个预测问题,即给机器提供一个句子,让它用另一种语言预测出相同的句子。请注意,我在这里使用的“预测”一词并不是指预测或预见未来。预测只是从一个定义明确的输入映射到一个最初未知但同样定义明确且可观察的输出。
What machine learning really does is it learns this mapping function, even if it is a very complicated mapping function. The bottom line is that if you can frame a problem as this mapping from an input to an output, then machine learning might be a good candidate to solve it.
机器学习真正做的就是学习这种映射函数,即使这是一个非常复杂的映射函数。归根结底,如果你能将一个问题框定为从输入到输出的映射,那么机器学习可能是一个很好的解决候选方案。
As for self-driving cars, you can think of it as not one, but multiple complex prediction problems: predicting the correct angle of the wheel from sensors in the front of the car, predicting the pressure in the brakes from cameras around the car, predicting the pressure in the accelerator from gps data.
至于自动驾驶汽车,你可以将其视为不是一种,而是多种复杂的预测问题:从汽车前部的传感器预测正确的车轮角度,从汽车周围的摄像头预测刹车的压力,从 GPS 数据预测油门的压力。
Solving those (and a tone more) of prediction problems is what makes a self driving car.
解决那些(以及更多)预测问题就是让自动驾驶汽车成为可能的原因。
A more technical way of thinking about ML is in term of estimating (possibly very complex) expectation functions:
一种更技术性的思考 ML 的方式是将其视为估计(可能是非常复杂的)期望函数:
Where
其中
OK… You now understand how prediction can be more powerful than we first thought. Self-driving cars and language translation are cool and all, but they are quite distant, unless you work at a major tech company like Google or Uber.
好的……你现在明白了预测比我们最初想象的要强大得多。自动驾驶汽车和语言翻译都很酷,但除非你在谷歌或优步这样的大科技公司工作,否则它们还是相当遥远的。
So, to make things more relatable, let’s talk in terms of problems almost every company has: customer acquisition (that is getting new customers).
所以,为了让事情更加贴近实际,我们来谈谈几乎所有公司都面临的问题:客户获取(即获得新客户)。
From the customer acquisition perspective, what you often have to do is figure out who the profitable customers are.
从客户获取的角度来看,你通常需要做的是找出哪些是盈利的客户。
In this problem, each customer has a cost of acquisition (maybe marketing costs, onboarding costs, shipping costs…) and will hopefully generate a positive cashflow for the company. For example, let’s say you are an internet provider or a gas company.
在这个问题中,每个客户都有获取成本(可能是营销成本、引导成本、运输成本等),并有望为公司带来正现金流。例如,假设你是一家互联网服务提供商或一家天然气公司。
Your typical customer might have a cash flow that looks something like this.
您的典型客户可能有如下所示的现金流。
Each bar represents a monetary event in the life of your relationship with the customer. For example, to get a customer, right off the bet, you need to invest in marketing.
每个条形代表您与客户关系中的一个财务事件。例如,要获得客户,一开始就需要投资营销。
Then, after someone decides to do business with you, you might incur some sort of onboarding cost (where you have to explain to your customer how to use your product) or installation costs. Only then, the customer starts to generate monthly revenues.
然后,在有人决定与您做生意后,您可能会产生某种形式的入门成本(即您必须向客户解释如何使用您的产品)或安装费用。只有这样,客户才开始产生月收入。
At some point, the customer might need some assistance and you will have maintenance costs. Finally, if the customer decides to end the contract, you might have some final costs for that too.
在某些时候,客户可能需要一些帮助,你将会有维护成本。最后,如果客户决定终止合同,你也可能会为此产生一些最终费用。
To see if this is a profitable customer, we can rearrange the bar in what is called a cascade plot. Hopefully, the sum of the cash events end up way above the zero line.
要判断这是否是一个盈利的客户,我们可以将条形图重新排列成所谓的瀑布图。希望现金事件的总和远高于零线。
In contrast, it could very well be that the customer will generate much more costs than revenues. If he or she uses very little of your product and has high maintenance demands, when we pile up the cash events, they could end up below the zero line.
相比之下,客户产生的成本可能远高于收入。如果他或她很少使用您的产品并且维护要求很高,当我们累计现金事件时,最终可能会低于零线。
Of course, this cash flow could be simpler or much more complicated, depending on the type of business. You can do stuff like time discounts with an interest rate and get all crazy about it, but I think the point here is made.
当然,这种现金流可以根据业务类型变得更简单或更复杂。你可以做一些像使用利率进行时间折扣这样的事情,并且可以在这方面变得非常复杂,但我想这里要表达的观点已经很清楚了。
But what can you do about this? Well, if you have many examples of profitable and non profitable customers, you can train a machine learning model to identify them. That way, you can focus your marketing strategies that engage only on the profitable customers.
但是你能对此做些什么呢?如果你有很多盈利和非盈利客户的例子,你可以训练一个机器学习模型来识别它们。这样,你就可以专注于只吸引盈利客户的营销策略。
Or, if your contract permits, you can end relations with a customer before he or she generates more costs. Essentially, what you are doing here is framing the business problem as a prediction problem so that you can solve it with machine learning: you want to predict or identify profitable and unprofitable customers so that you only engage with the profitable ones.
或者,如果合同允许,您可以在客户产生更多成本之前终止与其的关系。本质上,您在这里做的是将业务问题定义为一个预测问题,以便可以用机器学习来解决:您希望预测或识别出盈利和不盈利的客户,从而只与盈利的客户打交道。
Click to hide
import pandas as pd
import numpy as np
from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import style
style.use("ggplot")
For instance, suppose you have 30 days of transactional data on 10000 customers. You also have the cost of acquisition cacq
. This could be the bid you place for them if you are doing online marketing, it could be the cost of shipping or any training you have to do with your customer so they can use your product.
例如,假设你有 10000 名客户 30 天的交易数据。你还拥有获取成本 cacq
。这可能是你在进行在线营销时为他们出价的成本,也可能是运输成本或你需要为客户提供的任何培训成本,以便他们能够使用你的产品。
Also, for the sake of simplicity (this is a crash course, not a semester on customer valuation), let’s pretend you have total control of the customer that you do business with. In other words, you have the power to deny a customer even if he or she wants to do business with you.
另外,为了简化(这是一个速成课程,而不是关于客户价值的一个学期课程),我们假设你完全控制了与你做生意的客户。换句话说,即使客户想和你做生意,你也有权拒绝他们。
If that’s the case, your task now becomes identifying who will be profitable beforehand, so you can choose to engage only with them.
如果情况如此,你现在的工作就变成了预先识别出谁会是有利可图的,这样你就可以选择只与他们合作。
transactions = pd.read_csv("data/customer_transactions.csv")
print(transactions.shape)
transactions.head()
(10000, 32)
customer_id | cacq | day_0 | day_1 | day_2 | day_3 | day_4 | day_5 | day_6 | day_7 | ... | day_20 | day_21 | day_22 | day_23 | day_24 | day_25 | day_26 | day_27 | day_28 | day_29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | -110 | 6 | 0 | 73 | 10 | 0 | 0 | 0 | 21 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | -58 | 0 | 0 | 0 | 15 | 0 | 3 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2 | -7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 3 | -30 | 0 | 3 | 2 | 0 | 9 | 0 | 0 | 0 | ... | 0 | 0 | 40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4 | -42 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 32 columns
What we need to do now is distinguish the good from the bad customers according to this transactional data. For the sake of simplicity, I’ll just sum up all transactions and the CACQ.
我们现在需要做的是根据这些交易数据区分好客户和坏客户。为了简化起见,我将总结所有交易和 CACQ。
Keep in mind that this throws under the rug a lot of nuances, like distinguishing customers that are churned from those that are in a break between one purchase and the next.
请记住,这忽略了许多细微差别,比如区分流失的客户和那些在一次购买和下一次购买之间休息的客户。
I’ll then join this sum, which I call net_value
, with customer specific features. Since my goals is to figure out which customer will be profitable before deciding to engage with them, you can only use data prior to the acquisition period. In our case, these features are age, region and income, which are all available at another csv
file.
然后我会将这个我称为 net_value
的总和与客户特定特征结合起来。由于我的目标是在决定与他们合作之前确定哪些客户会盈利,因此你只能使用收购期之前的数据。在我们的情况下,这些特征是年龄、地区和收入,这些都可在另一个 csv
文件中找到。
profitable = (transactions[["customer_id"]]
.assign(net_value = transactions
.drop(columns="customer_id")
.sum(axis=1)))
customer_features = (pd.read_csv("data/customer_features.csv")
.merge(profitable, on="customer_id"))
customer_features.head()
customer_id | region | income | age | net_value | |
---|---|---|---|---|---|
0 | 0 | 30 | 1025 | 24 | 130 |
1 | 1 | 41 | 1649 | 26 | 10 |
2 | 2 | 18 | 2034 | 33 | -6 |
3 | 3 | 20 | 1859 | 35 | 136 |
4 | 4 | 1 | 1243 | 26 | -8 |
Good! Our task is becoming less abstract. We wish to identify the profitable customers (net_value > 0
) from the non profitable ones. Let’s try different things and see which one works better. But before that, we need to take a quick look into Machine Learning (feel free skip if you know how ML works)
好的!我们的任务变得不那么抽象了。我们希望从非盈利客户中识别出盈利客户( net_value > 0
)。让我们尝试不同的方法,看看哪种效果更好。但在那之前,我们需要快速了解一下机器学习(如果你知道机器学习如何工作,可以跳过)
Machine Learning Crash Course#
机器学习速成课程 #
For our intent and purpose, we can think of ML as an overpowered way of making predictions. For it to work, you need some data with labels or the ground truth of what you are predicting.
对于我们的意图和目的,我们可以将机器学习视为一种超强的预测方法。要使其工作,您需要一些带有标签或您正在预测的内容的真实情况的数据。
Then, you can train a ML model on that data and use it to make predictions where the ground truth is not yet known. The image below exemplifies the typical machine learning flow.
然后,您可以使用这些数据训练一个机器学习模型,并用它来进行预测,其中真实值尚未知晓。下图展示了典型的机器学习流程。
First, you need data where the ground truth, net_value
here, is known. Then, you train a ML model that will use features - region, income and age in our case - to predict net_value
. This training or estimating step will produce a machine learning model that can be used to make predictions about net_value
when you don’t yet have the true net_value
. This is shown in the left part of the image. You have some new data where you have the features (region, income and age) but you don’t know the net_value
yet. So you pass this data through your model and it provides you with net_value
predictions.
首先,你需要数据,在这里已知的是真实值 net_value
。然后,你训练一个机器学习模型,该模型将使用特征——在我们的情况下是地区、收入和年龄——来预测 net_value
。这个训练或估计步骤将产生一个机器学习模型,可以用来在你还未获得真实 net_value
时对 net_value
进行预测。这显示在图像的左侧部分。你有一些新数据,这些数据中你有特征(地区、收入和年龄),但你还不知道 net_value
。因此,你将这些数据通过你的模型,它会为你提供 net_value
预测。
If you are more into technical notation, another way of understanding machine learning is in term of estimating a conditional expectation
如果你更喜欢技术性记法,理解机器学习的另一种方式是将其视为估计条件期望
One tricky thing with ML models is that they can approximate almost any function. Another way of saying this is that they can be made so powerful as to perfectly fit the data in the training set. Machine learning models often have what we call complexity hyperparameters.
机器学习模型的一个棘手之处在于它们几乎可以逼近任何函数。换句话说,它们可以被设计得非常强大以至于能够完美拟合训练集中的数据。机器学习模型通常具有我们称之为复杂度超参数的设置。
These things adjust how powerful or complex the model can be. In the image below, you can see examples of a simple model (left), an intermediate model (middle) and a complex and powerful model (right). Notice how the complex model has a perfect fit of the training data.
这些东西调整模型可以有多强大或多复杂。在下面的图片中,你可以看到一个简单模型(左)、一个中级模型(中)和一个复杂且强大的模型(右)的例子。请注意,复杂模型对训练数据有完美的拟合。
This raises some problems. Namely, how can we know if our model is any good before using it to make predictions in the real world? One way we have is to compare the predictions with the actual values on the dataset where we have the ground truth.
这引发了一些问题。即,在将模型用于现实世界进行预测之前,我们如何知道模型是否有效?我们有一种方法是将预测结果与我们拥有真实值的数据集中的实际值进行比较。
These are so-called goodness of fit metrics, like
这些是所谓的拟合优度指标,比如
This is problematic, because it means this validation is misleading, since we can nail it just by making my model more powerful and complex.
这是有问题的,因为这意味着这种验证是误导性的,因为我们只需使我的模型更强大和复杂就可以达到目的。
Besides, it is generally not a good thing to have a very complex model. And you already have some intuition into why that is the case. In the image above, for instance, which model do you prefer? The more complex one that gets all the predictions right? Probably not. You probably prefer the middle one.
此外,拥有一个非常复杂的模型通常不是一件好事。你已经对为什么会这样有了一些直觉。例如,在上面的图片中,你喜欢哪个模型?是那个把所有预测都做对的更复杂的模型吗?可能不是。你可能更喜欢中间的那个。
It’s smoother, simpler and yet, it still makes some good predictions, even if it doesn’t perfectly fit the data.
它更平滑、更简单,尽管如此,它仍然能做出一些不错的预测,即使它并不完全符合数据。
Your intuition is in the right place. What happens if you give too much power to your model, is that it will not only learn the patterns in your data, but it also learns the random noise.
你的直觉是对的。如果你给模型过多的权力,它不仅会学习数据中的模式,还会学习随机噪声。
Since the noise will be different when you use the model to make predictions in the real world (it’s random after all), your “perfect” model will make mistakes. In ML terms, we say that models that are too complex are overfitting and don’t generalize well. So, what can we do?
由于在现实世界中使用模型进行预测时噪声会有所不同(毕竟噪声是随机的),你的“完美”模型也会犯错。用机器学习术语来说,我们称过于复杂的模型为过拟合,并且泛化能力差。那么,我们能做些什么呢?
We are going to pretend we don’t have access to parts of the data. The idea is to split the dataset for which we have the ground truth into two. Then, we can give one part for the model to train on and the other part we can use to validate the model predictions.
我们将假装无法访问部分数据。我们的想法是将我们有真实结果的数据集分成两部分。然后,我们可以将一部分用于模型训练,另一部分用于验证模型的预测。
This is called cross validation.
这称为交叉验证。
In the dataset above, which the model didn’t see during training, the complex model doesn’t do a very good job. The model in the middle, on the other hand, seems to perform better.
在上述数据集中,模型在训练过程中没有见过这些数据,复杂模型的表现并不好。而中间的模型似乎表现更好。
To choose the right model complexity, we can train different models, each one with a different complexity, and see how they perform on some data that we have the ground truth, but that was not used for training the model.
为了选择合适的模型复杂度,我们可以训练不同的模型,每个模型具有不同的复杂度,并观察它们在我们有真实结果但未用于训练模型的数据上的表现。
Cross validation is so important we should probably spend more time on it.
交叉验证非常重要,我们可能应该花更多时间在这上面。
Cross Validation#
Cross validation is essential for selecting the model complexity but it’s useful beyond that. We can use it whenever we want to try many different things and estimate how they would play out in the real world.
交叉验证对于选择模型复杂度至关重要,但其用途不仅限于此。每当我们要尝试许多不同的方法并估计它们在现实世界中的表现时,都可以使用它。
The idea being cross validation is to mimic the real world, where we estimate a model on the data that we have, but we make predictions on new, unseen data. The holdout data that we pretend not to have serves as a proxy to what we will encounter in the wild.
交叉验证的想法是为了模拟现实世界,在现实世界中,我们在已有的数据上估计模型,但在新的、未见过的数据上进行预测。我们假装没有的保留数据作为我们在实际中遇到情况的代理。
Let’s see how we can apply cross validation to the whole problem of figuring out which customers are profitable or not. Here is an outline of what we should do:
让我们看看如何将交叉验证应用于确定哪些客户是有利可图的整个问题。以下是我们应该做的概述:
We have data on existing customers. On this data, we know which ones are profitables and which ones are not (we know the ground truth). Let’s call our internal data the training set.
我们有关于现有客户的数据。通过这些数据,我们知道哪些客户是盈利的,哪些不是(我们知道真实情况)。让我们把我们的内部数据称为训练集。We will use the internal data to learn a rule that tells us which customer is profitable (hence training).
我们将使用内部数据来学习一个规则,该规则告诉我们哪个客户是有利润的(因此进行训练)。We will apply the rule to the holdout data that was not used for learning the rule. This should simulate the process of learning a rule in one dataset and applying it to another, a process that will be inevitable when we go to production and score truly unseen data.
我们将把这个规则应用到未用于学习该规则的保留数据上。这应该模拟在一个数据集中学习规则并将其应用于另一个数据集的过程,当我们进入生产并对真正未见过的数据进行评分时,这个过程将是不可避免的。
Here is a picture of what cross validation looks like. There is the truly unseen data at the rightmost part of the image and then there is data that we only pretend not to have at learning time.
这是一张交叉验证的样子的图片。图像最右边是真正未见过的数据,然后是在学习时我们假装没有的数据。
To summarize, we will partition our internal data into a training and a test set. We can use the training set to come up with models or rules that predict if a customer is profitable or not, but we will validate those rules in another partition of the dataset: the test set.
总之,我们将把内部数据分为训练集和测试集。我们可以使用训练集来制定预测客户是否盈利的模型或规则,但将在数据集的另一个分区:测试集中验证这些规则。
This test set will be hidden from our learning procedure.
这个测试集将对我们的学习过程隐藏。
Just as a side note here, there are tons of ways to make cross validation better other than this simple train test split (k-fold cross-validation or temporal cross validation, for instance), but for the sake of what we will do here, this is enough.
顺便提一下,除了这种简单的训练测试划分之外,还有很多方法可以改进交叉验证(例如 k 折交叉验证或时间交叉验证),但就我们这里要做的事情而言,这样已经足够了。
Remember that the spirit of cross validation is to simulate what would happen once we go to a production environment. By doing that we hope to get more realistic estimates.
请记住,交叉验证的精神在于模拟我们在进入生产环境后会发生的情况。通过这样做,我们希望获得更现实的估计。
For our case, I won’t do anything fancy. I’ll just divide the dataset into two. 70% will be used to build a method that allows us to identify profitable customers and 30% will be used to evaluate how good that method is.
对于我们的案例,我不会做任何复杂的事情。我只会将数据集分成两部分。70%将用于构建一种方法,以帮助我们识别盈利客户,30%将用于评估该方法的好坏。
train, test = train_test_split(customer_features, test_size=0.3, random_state=13)
train.shape, test.shape
((7000, 5), (3000, 5))
Predictions and Policies#
预测和政策 #
We’ve been talking about methods and approaches to identify profitable customers but it is time we get more precise with our concepts. Let’s introduce two new ones. A prediction is a number that estimates or predicts something. It’s the estimation of
我们一直在讨论识别盈利客户的方法和途径,但现在是时候让我们的概念更加精确了。让我们引入两个新概念。预测是对某事物的估计或预测的数值。它是对
The second concept is that of a policy. A policy is an automatic decision rule. While a prediction is a number, a policy is a decision. For example, we can have a policy that engages with customers with income greater than 1000 and doesn’t engage otherwise.
第二个概念是策略。策略是一种自动决策规则。预测是一个数字,而策略则是一个决策。例如,我们可以有一个策略,对于收入大于 1000 的客户进行互动,否则不进行互动。
We usually build policies on top of predictions: engage with all customers that have profitability predictions above 10 and don’t engage otherwise,
我们通常基于预测来制定策略:与所有预测利润高于 10 的客户互动,否则不互动。机器学习通常会处理第一个概念,即预测。但请注意,仅预测本身是无用的。我们需要附加一些决策或策略。
We can do very simple policies and models or very complicated ones. For both policies and predictions, we need to use cross validation, that is, estimate the policy or prediction in one partition of the data and validate its usefulness in another.
我们可以制定非常简单的政策和模型,也可以制定非常复杂的政策和模型。无论是政策还是预测,我们都需要使用交叉验证,即在一个数据分区中估计政策或预测,并在另一个数据分区中验证其有效性。
Since we’ve already partitioned our data into two, we are good to go.
既然我们已经将数据分成了两部分,我们就可以开始了。
One Feature Policies# 一项功能策略 #
Before we go machine learning crazy on this profitability problem, let’s try the simple stuff first. The 80% gain with 20% effort stuff. They often work wonders and, surprising, most data scientists forget about them. So, what is the simplest thing we can do? Naturally, just engage with all the customers! Instead of figuring out which ones are profitable, let’s just do business with everyone and hope the profitable customers more than compensate for the non profitable ones.
在我们对这个盈利问题进行机器学习之前,让我们先尝试简单的方法。那些用 20%的努力获得 80%收益的方法。它们通常能创造奇迹,而且令人惊讶的是,大多数数据科学家都忘记了这些方法。那么,我们能做的最简单的事情是什么?自然而然地,就是与所有客户互动!与其去弄清楚哪些客户是有利可图的,不如与所有人做生意,并希望有利可图的客户能够弥补那些无利可图的客户。
To check if this is a good idea, we can see the average net value of the customers. If that turns out to be positive, it means that, on average, we will make money on our customers.
要检查这是否是个好主意,我们可以看看客户的平均净价值。如果结果是正数,那就意味着平均而言,我们将在客户身上赚钱。
Sure, there will be profitable and non profitable ones but, on average, if we have enough customers, we will make money. On the other hand, if this value is negative, it means that we will lose money if we engage with all the customers.
当然,会有盈利的和不盈利的,但平均来说,如果我们有足够的客户,我们就会赚钱。另一方面,如果这个值是负数,那就意味着如果我们与所有客户打交道,我们会亏钱。
train["net_value"].mean()
-29.169428571428572
That’s a bummer… If we engage with everyone, we would lose about 30 reais for customers we do business with. Our first, very simple thing didn’t work and we better find something more promising if we don’t want to go out of business.
那真是太糟糕了……如果我们与所有人合作,我们会对每个客户亏损大约 30 雷亚尔。我们的第一个非常简单的方法没有奏效,我们最好找到更有希望的方法,否则我们就得关门大吉了。
Just a quick side note here, keep in mind that this is a pedagogical example. Although the very simple, “treat everyone the same” kind of policy didn’t work here, they often do in real life.
这里只是顺便提一下,请记住这是一个教学示例。虽然“一视同仁”这种非常简单的政策在这里不起作用,但在现实生活中它们通常有效。
It is usually the case that sending a marketing email to everyone is better than not sending it, or giving discounts coupons to everyone is often better than not giving them.
通常情况下,给所有人发送营销邮件比不发送要好,或者给所有人提供折扣券往往比不提供要好。
Moving forward, what is the next simplest thing we can think of? One idea is taking our features and seeing if they alone distinguish the good from the bad customers. Take income
, for instance. It’s intuitive that richer customers should be more profitable, right? What if we do business only with the top richest customers? Would that be a good idea?
To figure this out we can partition our data into income quantiles (a quantile has the propriety of dividing the data into partitions of equal size, that’s why I like them). Then, for each income quantile, let’s compute the average net value. The hope here is that, although the average net value in negative,
plt.figure(figsize=(12,6))
np.random.seed(123) ## seed because the CIs from seaborn uses boostrap
# pd.qcut create quantiles of a column
sns.barplot(data=train.assign(income_quantile=pd.qcut(train["income"], q=20)),
x="income_quantile", y="net_value")
plt.title("Profitability by Income")
plt.xticks(rotation=70);
And, sadly, nope. Yet again, all levels of income have negative average net value. Although it is true that richer customers are “less bad” than non rich customers, they still generate, on average, negative net value. So income didn’t help us much here, but what about the other variables, like region? If most of our costs come, say, from having to serve customers in far away places, we should expect that the region distinguishes the profitable from the unprofitable customers.
Since region is already a categorical variable, we don’t need to use quantiles here. Let’s just see the average net value per region.
plt.figure(figsize=(12,6))
np.random.seed(123)
region_plot = sns.barplot(data=train, x="region", y="net_value")
plt.title("Profitability by Region");
Bingo! We can clearly see that some regions are profitable, like regions 2, 17, 39, and some are not profitable, like regions 0, 9, 29 and the especially bad region 26. This is looking super promising! We can take this and transform into a policy: only do business with the regions that showed to be profitable according to the data that we have here.
One thing to notice is that what we are doing is what ML would do, but in a much simpler way. Namely, we are estimating the expected value of net value in each region:
To construct this policy, we will do something very simple. We will construct a 95% confidence interval around the expected net value of a region. If it is greater than zero, we will do business with that region
The following code builds a dictionary where the key is the region and the value is the lower bound of the 95% CI. Then, the dictionary generator filters only those regions where the expected net value is positive. The result is the regions we will do business with according to our policy.
# extract the lower bound of the 95% CI from the plot above
regions_to_net = train.groupby('region')['net_value'].agg(['mean', 'count', 'std'])
regions_to_net = regions_to_net.assign(
lower_bound=regions_to_net['mean'] - 1.96*regions_to_net['std']/(regions_to_net['count']**0.5)
)
regions_to_net_lower_bound = regions_to_net['lower_bound'].to_dict()
regions_to_net = regions_to_net['mean'].to_dict()
# filters regions where the net value lower bound is > 0.
regions_to_invest = {region: net
for region, net in regions_to_net_lower_bound.items()
if net > 0}
regions_to_invest
{1: 2.9729729729729737,
2: 20.543302704837856,
4: 10.051075065003388,
9: 32.08862469914759,
11: 37.434210420891255,
12: 37.44213667009523,
15: 32.09847683044394,
17: 39.52753893574483,
18: 41.86162250217046,
19: 15.62406327716401,
20: 22.06654814414531,
21: 24.621030401718578,
25: 33.97022928360584,
35: 11.68776141117673,
37: 27.83183541449011,
38: 49.740709395699994,
45: 2.286387928016998,
49: 17.01853709535029}
regions_to_invest
has all the regions we will engage with. Lets now see how this policy would have performed in our test set, the one we pretend not to have. This is a key step in evaluating our policy, because it could very well be that, simply by chance, a region in our training set is appearing to be profitable. If that is only due to randomness, it will be unlikely that we will find that same pattern in the test set.
To do so, we will filter the test set to contain only the customers in the regions defined as profitable (according to our training set). Then, we will plot the distribution of net income for those customers and also show the average net income of our policy.
region_policy = (test[test["region"]
# filter regions in regions_to_invest
.isin(regions_to_invest.keys())])
sns.histplot(data=region_policy, x="net_value")
# average has to be over all customers, not just the one we've filtered with the policy
plt.title("Average Net Income: %.2f" % (region_policy["net_value"].sum() / test.shape[0]));
Machine Learning Models as Policy Inputs#
If you are willing to do even better, we can now use the power of machine learning. Keep in mind that this might add tones of complexity to the whole thing and usually only marginal gains. But, depending on the circumstances, marginal gains can be translated into huge piles of money and that’s why machine learning is so valuable these days.
Here, I’ll use a Gradient Boosting model. It’s a fairly complicated model to explain, but one that is very simple to use. For our purpose, we don’t need to get into the details of how it works. Instead, just remember what we’ve seen in our ML Crash course: a ML model is a super powerful predictive machine that has some complexity parameters. It’s a tool to estimate
Now, we need to ask, how can good predictions be used to improve upon our simple region policy to identify and engage with profitable customers? I think there are two main improvements that we can make here. First, you will have to agree that going through all the features looking for one that distinguishes good from bad customers is a cumbersome process. Here, we had only 3 of them (age, income and region), so it wasn’t that bad, but imagine if we had more than 100. Also, you have to be careful with issues of multiple testing and false positive rates. The second reason is that it is probably the case that you need more than one feature to distinguish between customers. In our example, we believe that features other than region also have some information on customer profitability. Sure, when we looked at income alone it didn’t give us much, but what about income in those regions that are just barely unprofitable? Maybe, in those regions, if we focus only on richer customers, we could still get some profit. Technically speaking, we are saying that NetValue
than
Coming up with these more complicated policies that involve interacting more than one feature can be super complex. The combinations we have to look at grow exponentially with the number of features and it is simply not a practical thing to do. Instead, what we can do is throw all those features into a machine learning model and have it learn those interactions for us. This is precisely what we will do next.
The goal of this model will be to predict net_value
using region
, income
, age
. To help it, we will take the region feature, which is categorical, and encode it with a numerical value. We will replace each region by the region’s average net value on the training set. Remember that we have those stored in the regions_to_net
dictionary? With this, all we have to do is call the method .replace()
and pass this dictionary as the argument. I’ll create a function for this, because we will do this replacement multiple times. This process of transforming features to facilitate learning is generally called feature engineering.
def encode(df):
return df.replace({"region": regions_to_net})
Next, our model will be imported from Sklearn. All their models have a pretty standard usage. First, you instantiate the model passing in the complexity parameters. For this model, we will set the number of estimators to 400, the max depth to 4 and so on. The deeper the model and the greater the number of estimators, the more powerful the model will be. Of course, we can’t let it be too powerful, otherwise it will learn the noise in the training data or overfit to it. Again, you don’t need to know the details of what these parameters do. Just keep in mind that this is a very good prediction model. Then, to train our model, we will call the .fit()
method, passing the features X
and the variable we want to predict - or target variable - net_value
.
model_params = {'n_estimators': 400,
'max_depth': 4,
'min_samples_split': 10,
'learning_rate': 0.01,
'loss': 'ls'}
features = ["region", "income", "age"]
target = "net_value"
np.random.seed(123)
reg = ensemble.GradientBoostingRegressor(**model_params)
# fit model on the training set
encoded_train = train[features].pipe(encode)
reg.fit(encoded_train, train[target]);
The model is now trained. Next, we need to check if it is any good. To do this, we can look at the predictive performance of this model on the test set. There are tons of metrics to evaluate the predictive performance of a machine learning model. Here, I’ll use one which is called net_income
). Also, net_income
is explained by our model.
train_pred = (encoded_train
.assign(predictions=reg.predict(encoded_train[features])))
print("Train R2: ", r2_score(y_true=train[target], y_pred=train_pred["predictions"]))
print("Test R2: ", r2_score(y_true=test[target], y_pred=reg.predict(test[features].pipe(encode))))
Train R2: 0.7108790300152951
Test R2: 0.6938513063048141
In this case, the model explains about 71% of the net_income
variance in the training set but only about 69% of the net_income
variance in the test set. This is expected. Since the model had access to the training set, the performance there will often be overestimated. Just for fun (and to learn more about overfitting), try setting the ‘max_depth’ of the model to 14 and see what happens. You will likely see that the train
Next, in order to make our policy, we will store the test set predictions in a prediction
column. This predictions are estimates of
model_policy = test.assign(prediction=reg.predict(test[features].pipe(encode)))
model_policy.head()
customer_id | region | income | age | net_value | prediction | |
---|---|---|---|---|---|---|
5952 | 5952 | 19 | 1983 | 23 | 21 | 47.734883 |
1783 | 1783 | 31 | 914 | 31 | -46 | -36.026935 |
4811 | 4811 | 33 | 1349 | 25 | -19 | 22.553420 |
145 | 145 | 20 | 1840 | 26 | 55 | 48.306256 |
7146 | 7146 | 19 | 3032 | 34 | -17 | 7.039414 |
Just like we did with the regions
feature, we can show the average net value by predictions of our model. Since the model is continuous and not categorical, we need to make it discrete first. One way of doing so is using pandas pd.qcut
(by golly! I love this function!), which partitions the data into quantiles using the model prediction. Let’s use 50 quantiles because 50 is the number of regions that we had. And just as a convention, I tend to call these model quantiles model bands, because it gives the intuition that this group has model predictions within a band, say, from -10 to 200.
plt.figure(figsize=(12,6))
n_bands = 50
bands = [f"band_{b}" for b in range(1,n_bands+1)]
np.random.seed(123)
model_plot = sns.barplot(data=model_policy
.assign(model_band = pd.qcut(model_policy["prediction"], q=n_bands)),
x="model_band", y="net_value")
plt.title("Profitability by Model Prediction Quantiles")
plt.xticks(rotation=70);
Here, notice how there are model bands where the net value is super negative, while there are also bands where it is very positive. Also, there are bands where we don’t know exactly if the net value is negative or positive. Finally, notice how they have an upward trend, from left to right. Since we are predicting net value, it is expected that the prediction will be proportional to what it predicts.
Now, to compare this policy using a machine learning model with the one using only the regions we can also show the histogram of net gains, along with the total net value in the test set.
Click to show
plt.figure(figsize=(10,6))
model_plot_df = (model_policy[model_policy["prediction"]>0])
sns.histplot(data=model_plot_df, x="net_value", color="C2", label="model_policy")
region_plot_df = (model_policy[model_policy["region"].isin(regions_to_invest.keys())])
sns.histplot(data=region_plot_df, x="net_value", label="region_policy")
plt.title("Model Net Income: %.2f; Region Policy Net Income %.2f." %
(model_plot_df["net_value"].sum() / test.shape[0],
region_plot_df["net_value"].sum() / test.shape[0]))
plt.legend();
As we can see, the model generates a better policy than just using the region
feature, but not by much. While the model policy would have made us about 16.6 reais / customer on the test set, the region policy would have made us only 15.5 / customer. It’s just slightly better, but if you have tons and tons of customers, this might already justify using a model instead of a simple one feature policy.
Fine Grain Policy#
As a recap, so far, we tested the most simple of all policies, which is just engaging with all the customers. This policy can be seen as estimating the marginal net value,
Here, the decision which the policy handles is very simple: engage with a customer or don’t engage. The policies we had so far dealt with the binary case. They were in the form of
if prediction > 0 then do business else don't do business.
This is something we call thresholding. If the prediction exceeds a certain threshold (zero in our case, but could be something else), we take one decision, if it doesn’t, we take another. One other example of where this could be applied in real life is transactional fraud detection: if the prediction score of a model that detects fraud is above some threshold X
, we deny the transaction, otherwise we approve it.
Thresholding works in lots of real case scenarios and it is particularly useful when the nature of the decision is binary. However, we can think of cases where things tend to be more nuanced. For example, you might be willing to spend more on marketing to get the attention of very profitable customers. Or you might want to add them to some prime customers list, where you give special treatment to them, but it also costs you more to do so. Notice that if we include these possibilities, your decision goes from binary (engage vs don’t engage) to continuous: how much should you invest in a customer.
Here, for the next example, suppose your decision is not just who to do business with, but how much marketing costs you should invest in each customer. And for the sake of the example, assume that you are competing with other firms and whoever spends more on marketing in a particular customer wins that customer (much like a bidding mechanism). In that case, it makes sense to invest more in highly profitable customers, less in marginally profitable customers and not at all in non profitable customers.
One way to do that is to discritize your predictions into bands. We’ve done this previously for the purpose of model comparison, but here we’ll do it for decision making. Let’s create 20 bands. We can think of those as quantiles or equal size groups. The first band will contain the 5% less profitable customers according to our predictions, the second band will contain from the 5% to the 10% less profitable and so on. The last band, 20, will contain the most profitable customers.
Notice that the binning too has to be estimated on the training set and applied on the test set! For this reason, we will compute the bins using pd.qcut
on the training set. To actually do the binning, we will use np.digitize
, passing the bins that were precomputed on the training set.
def model_binner(prediction_column, bins):
# find the bins according to the training set
bands = pd.qcut(prediction_column, q=bins, retbins=True)[1]
def binner_function(prediction_column):
return np.digitize(prediction_column, bands)
return binner_function
# train the binning function
binner_fn = model_binner(train_pred["predictions"], 20)
# apply the binning
model_band = model_policy.assign(bands = binner_fn(model_policy["prediction"]))
model_band.head()
customer_id | region | income | age | net_value | prediction | bands | |
---|---|---|---|---|---|---|---|
5952 | 5952 | 19 | 1983 | 23 | 21 | 47.734883 | 18 |
1783 | 1783 | 31 | 914 | 31 | -46 | -36.026935 | 7 |
4811 | 4811 | 33 | 1349 | 25 | -19 | 22.553420 | 15 |
145 | 145 | 20 | 1840 | 26 | 55 | 48.306256 | 18 |
7146 | 7146 | 19 | 3032 | 34 | -17 | 7.039414 | 13 |
Click to show
plt.figure(figsize=(10,6))
sns.barplot(data=model_band, x="bands", y="net_value")
plt.title("Model Bands");
With these bands, we can allocate the bulk of our marketing investments to band 20 and 19. Notice how we went from a binary decision (engage vs not engage), to a continuous one: how much to invest on marketing for each customer. Of course you can fine tune this even more, adding more bands. In the limit, you are not binning at all. Instead, you are using the raw prediction of the model and you can create decision rules like
mkt_investments_i = model_prediction_i * 0.3
where for each customer
Key Ideas#
We’ve covered A LOT of ground here in a very short time, so I think this recap is extremely relevant for us to see what we accomplished here. First, we learned how the majority of machine learning applications involve nothing more than making good predictions, where prediction is understood as mapping from a known input to an initially unknown, but well defined output. We can also understand prediction as estimating an expectation function
Then, we got back down to earth and looked at how good predictions can help us with more common tasks, like figuring out which customer we should bring in and which to avoid. Specifically, we looked at how we could predict customer profit. With that prediction, we built a policy that decides who we should do business with. Notice that this is just an example of where prediction models can be applied. There are sure tones of other applications, like credit card underwriting, fraud detection, cancer diagnostics and anything else where good predictions might be useful.
The key takeaway here is that if you can frame your business problem as a prediction problem, then machine learning is probably the right tool for the job. I really can’t emphasize this enough. With all the hype around machine learning, I feel that people forget about this very important point and often end up making models that are very good at predicting something totally useless. Instead of thinking about how to frame a business problem as a prediction problem and then solving it with machine learning, they often build a prediction model and try to see what business problem could benefit from that prediction. This might work, but, more often than not, is a shot in the dark that only generates solutions in search of a problem.
References#
The things I’ve written here are mostly stuff from my head. I’ve learned them through experience. This means there isn’t a direct reference I can point you to. It also means that the things I wrote here have not passed the academic scrutiny that good science often goes through. Instead, notice how I’m talking about things that work in practice, but I don’t spend too much time explaining why that is the case. It’s a sort of science from the streets, if you will. However, I am putting this up for public scrutiny, so, by all means, if you find something preposterous, open an issue and I’ll address it to the best of my efforts.
Finally, I believe I might have been too quick for those who were hoping for a comprehensive and detailed introduction of machine learning. To be honest, I believe that where I can truly generate value is teaching about causal inference, not machine learning. For the latter, there are tons of amazing online resources, much better than I could ever dream of creating. The classical one is Andrew Ng’s course on Machine Learning and I definitely recommend you take a look into it if you are new to machine learning.
Contribute#
Causal Inference for the Brave and True is an open-source material on causal inference, the statistics of science. It uses only free software, based in Python. Its goal is to be accessible monetarily and intellectually. If you found this book valuable and you want to support it, please go to Patreon. If you are not ready to contribute financially, you can also help by fixing typos, suggesting edits or giving feedback on passages you didn’t understand. Just go to the book’s repository and open an issue. Finally, if you liked this content, please share it with others who might find it useful and give it a star on GitHub.