Under the Hood of Uber’s Experimentation Platform
优步实验平台的内部机制

28 August 2018 / Global
2018 年 8 月 28 日 / 全球

Featured image for Under the Hood of Uber’s Experimentation Platform

Share 分享

Experimentation is at the core of how Uber improves the customer experience. Uber applies several experimental methodologies to use cases as diverse as testing out a new feature to enhancing our app design.
实验是优步改善客户体验的核心。优步应用多种实验方法来测试新功能或改进应用设计。

Uber’s Experimentation Platform (XP) plays an important role in this process, enabling us to launch, debug, measure, and monitor the effects of new ideas, product features, marketing campaigns, promotions, and even machine learning models. The platform supports experiments across our driver, rider, Uber Eats, and Uber Freight apps and is widely used to run A/B/N, causal inference, and multi-armed bandit (MAB)-based continuous experiments.
优步的实验平台（XP）在这个过程中扮演着重要角色，使我们能够启动、调试、衡量和监控新想法、产品特性、营销活动、促销活动，甚至机器学习模型的影响。该平台支持在我们的司机、乘客、优步外卖和优步货运应用程序上进行实验，并广泛用于运行基于 A/B/N、因果推断和基于多臂老虎机（MAB）的持续实验。

There are over 1,000 experiments running on our platform at any given time. For example, before Uber launched our new driver app, completely redesigned with our driver-partners in mind, it went through extensive hypothesis testings through a series of experiments conducted with our XP.
我们的平台上每时每刻都有超过 1,000 个实验正在运行。例如，在 Uber 推出我们全新的司机应用程序之前，完全根据我们的司机合作伙伴重新设计，它经历了一系列通过我们的 XP 进行的广泛假设测试。

At a high level, Uber’s XP allows engineers and data scientists to monitor treatment effects to ensure they do not cause regressions of any key metrics. The platform also lets users configure the universal holdout, used to measure the long-term effects of all experiments for a specific domain.
在高层次上，Uber 的 XP 允许工程师和数据科学家监控处理效果，以确保它们不会导致任何关键指标的退化。该平台还允许用户配置通用保留，用于衡量特定领域所有实验的长期效果。

Figure 1. On average, over 1,000 experiments are running on Uber’s Experimentation Platform at any given time.
图 1. 平均而言，Uber 的实验平台上每时每刻都有超过 1,000 个实验正在运行。

Below is a chart outlining the types of experimentation methodologies that the Experimentation Platform team uses:
以下是实验平台团队使用的实验方法论类型的图表：

Figure 2. Uber’s Experimentation Platform conducts both randomized experiments and observational studies.
图 2. Uber 的实验平台进行随机实验和观察性研究。

There are various factors that determine which statistics methodology we should apply to a given use case. Broadly, we use four types of statistical methodologies: fixed horizon A/B/N tests (t-test, chi-squared, and rank-sum tests), sequential probability ratio tests (SPRT), causal inference tests (synthetic control and diff-in-diff tests), and continuous A/B/N tests using bandit algorithms (Thompson sampling, upper confidence bounds, and Bayesian optimization with contextual multi-armed-bandit tests, to name a few). We also apply block bootstrap and delta methods to estimate standard errors, as well as regression-based methods to measure bias correction when calculating the probability of type I and type II errors in our statistical analyses.
有各种因素决定了我们应该将哪种统计方法应用于特定的用例。广义上，我们使用四种类型的统计方法：固定时间段的 A/B/N 测试（t 检验、卡方检验和秩和检验）、顺序概率比检验（SPRT）、因果推断检验（合成控制和差异-差异检验）以及使用赌博算法的连续 A/B/N 测试（汤普森抽样、上置信界限和贝叶斯优化与上下文多臂赌博测试等）。我们还应用区块自举和增量方法来估计标准误差，以及基于回归的方法来测量偏差校正，当计算我们的统计分析中的 I 型和 II 型错误的概率时。

In this article, we discuss how each of these statistical methods are used by Uber’s Experimentation Platform to improve our services.
在本文中，我们讨论了 Uber 实验平台如何使用这些统计方法来改进我们的服务。

Classic A/B testing 经典的 A/B 测试

Randomized A/B or A/B/N tests are considered the gold standard in many quantitative scientific fields for evaluating treatment effects. Uber applies this technique to make objective, data-driven, and scientifically rigorous product and business decisions. In essence, classic A/B testing enables us to randomly split users into control and treatment groups to compare the decision metrics between these groups and determine the experiment’s treatment effects.
随机 A/B 或 A/B/N 测试被认为是许多定量科学领域评估治疗效果的黄金标准。Uber 应用这种技术来做客观、数据驱动和科学严谨的产品和业务决策。基本上，经典的 A/B 测试使我们能够将用户随机分为控制组和治疗组，比较这些组之间的决策指标，并确定实验的治疗效果。

Figure 3. Uber’s Experimentation Platform team conducts randomized experiments by leveraging A/B/N tests to determine the lift.
图 3. Uber 的实验平台团队通过利用 A/B/N 测试进行随机实验，以确定提升。

A common use case for this methodology is feature release experiments. Suppose a product manager wants to evaluate whether a new feature increases user satisfaction with Uber’s platform. The product manager could use our XP to glean the following metrics: the average values of the metric in both treatment and control groups, the lift (treatment effect), whether the lift is significant, and whether the sample sizes are large enough to wield high statistical power.
这种方法论的常见用例是特性发布实验。假设产品经理想要评估一个新特性是否能提高用户对 Uber 平台的满意度。产品经理可以使用我们的 XP 来获取以下指标：治疗组和对照组的指标平均值、提升（治疗效果）、提升是否显著，以及样本大小是否足够大以具有高统计功效。

Figure 4. Our XP analytics dashboard makes it easy for data scientists and other users to access and interpret their A/B test results.
图 4. 我们的 XP 分析仪表板使数据科学家和其他用户能够轻松访问和解释他们的 A/B 测试结果。

Statistics engine 统计引擎

One of our team’s main goals is to deliver one-size-fits-most methodologies of hypothesis testing that can be applied to use cases across the company. To accomplish this, we collaborated with multiple stakeholders to build a statistics engine.
我们团队的主要目标之一是提供适用于公司各个用例的一揽子假设检验方法。为了实现这一目标，我们与多方利益相关者合作建立了一个统计引擎。

When we analyze a randomized experiment, the first step is to pick a decision metric (e.g., rider gross bookings). This choice relates directly to the hypothesis being tested. Our XP enables experimenters to easily reuse pre-defined metrics and automatically handles data gathering and data validation. Depending on the metrics type, our statistics engine applies different statistical hypothesis testing procedures and generates easy-to-read reports. At Uber, we invest heavily in the research and validation of methodologies and are constantly improving the robustness and effectiveness of our statistics engine.
当我们分析一个随机实验时，第一步是选择一个决策指标（例如，骑手总订单额）。这个选择直接关系到正在测试的假设。我们的 XP 使实验者能够轻松重复使用预定义的指标，并自动处理数据收集和数据验证。根据指标类型，我们的统计引擎应用不同的统计假设检验程序，并生成易于阅读的报告。在 Uber，我们大力投资于方法论的研究和验证，并不断改进我们的统计引擎的稳健性和有效性。

Figure 5, below, offers a high-level overview of this powerful tool:
图 5 如下，提供了这个强大工具的高层概述：

Figure 5: Uber’s statistics engine is used for A/B/N experiments and dictated by fixed horizon hypothesis testing methodologies.
图 5：Uber 的统计引擎用于 A/B/N 实验，并由固定视野假设检验方法所决定。

Key components and statistical methodologies
关键组件和统计方法论

After gathering data, our XP’s analytic platform validates the data and detects two major issues for experimenters to watch for and to keep a healthy skepticism in their A/B experiments:
在收集数据后，我们的 XP 分析平台验证数据并检测出两个主要问题，供实验者关注并保持对他们的 A/B 实验持健康的怀疑态度：

Sample size imbalance, meaning that the sample size ratio in the control and treatment groups is significantly different from what was expected. In these scenarios, experimenters must double check their randomization mechanisms.
样本量不平衡，意味着控制组和治疗组的样本量比例与预期显著不同。在这些情况下，实验者必须仔细检查他们的随机化机制。
Flickers, which refers to users that have switched between control and treatment groups. For example, a rider purchases a new Android cell phone to replace an old iPhone, while the treatment of the experiment was only configured for iOS. The rider would switch from the treatment group to the control group. Existence of such users might contaminate the experiment results, so we would exclude these users (flickers) in our analyses.
闪烁，指的是在控制组和处理组之间切换的用户。例如，一个骑手购买了一部新的安卓手机来替换旧的 iPhone，而实验的处理仅针对 iOS 进行配置。骑手会从处理组切换到控制组。这类用户的存在可能会污染实验结果，因此我们会在分析中排除这些用户（闪烁）。

Most of our use cases are randomized experiments and most of the time summarized data is sufficient for performing fixed horizon A/B tests. At the user level, there are three distinct types of metrics:
我们大多数的使用案例都是随机实验，大部分时间总结数据足以执行固定时间范围的 A/B 测试。在用户层面上，有三种不同类型的指标：

Continuous metrics contain one numeric value column, e.g., gross bookings per user.
连续指标包含一个数值列，例如，每个用户的总预订额。
Proportion metrics contain one binary indicator value column, e.g., to test the proportion of users who complete any trips after sign-up.
比例指标包含一个二进制指标值列，例如，用于测试注册后完成任何行程的用户比例。
Ratio metrics contain two numeric value columns, the numerator values and the denominator values, e.g., the trip completion ratio, where the numerator values are the number of completed trips, and the denominator values are the number of total trip requests.
比率指标包含两个数字值列，即分子值和分母值，例如，行程完成比率，其中分子值是已完成行程的数量，分母值是总行程请求的数量。

Three variants of data preprocessing are applied to improve the robustness and effectiveness of our A/B analyses:
我们应用了三种数据预处理的变体，以提高我们 A/B 分析的鲁棒性和有效性：

Outlier detection removes irregularities in data and improves the robustness of analytic results. We use a clustering-based algorithm to perform outlier detection and removal.
异常值检测可以消除数据中的不规则性，并提高分析结果的稳健性。我们使用基于聚类的算法来进行异常值检测和移除。
Variance reduction helps increase the statistical power of hypothesis testing, which is especially helpful when the experiment has a small user base or when we need to end the experiment prematurely without sacrificing scientific rigor. The CUPED Method leverages extra information we have and reduces the variance in decision metrics.
方差缩减有助于增加假设检验的统计功效，特别是在实验用户基数较小或需要提前结束实验而不损害科学严谨性时尤其有帮助。CUPED 方法利用我们拥有的额外信息，减少决策指标中的方差。
Pre-experiment bias is a big challenge at Uber because of our diversity of users. Sometimes, constructing robust counterfactual via mere randomization just doesn’t cut it. Difference in differences (diff-in-diff) is a well-accepted method in quantitative research and we use it to correct pre-experiment bias between groups so as to produce reliable treatment effects estimation.
预实验偏差在 Uber 是一个很大的挑战，因为我们的用户群体多样化。有时，仅仅通过随机化构建强大的对照组是不够的。差异中的差异（diff-in-diff）是量化研究中一个被广泛接受的方法，我们使用它来纠正组间的预实验偏差，以产生可靠的治疗效果估计。

The p-value calculation is central to our statistics engine. The p-value directly determines whether the XP reports that a result is significant. We compare the p-value to the false positive rate (Type-I error) we desire (0.05) in a common A/B test. Our XP leverages various procedures for p-value calculation, including:
p 值计算是我们统计引擎的核心。p 值直接决定了 XP 报告结果是否显著。我们将 p 值与我们期望的假阳性率（Type-I 错误）0.05 在常见的 A/B 测试中进行比较。我们的 XP 利用各种程序进行 p 值计算，包括：

Welch’s t-test, the default test used for continuous metrics, e.g., completed trips.
Welch's t 检验，用于连续指标的默认测试，例如完成的行程。
The Mann-Whitney U test, a nonparametric rank sum test used to detect severe skewness in the data. It requires weaker assumptions than the t-test and performs better with skewed data.
Mann-Whitney U 检验是一种非参数秩和检验，用于检测数据中的严重偏斜。它比 t 检验需要更弱的假设，并且在数据偏斜时表现更好。
The Chi-squared test, used for proportion metrics, e.g., rider retention rate.
卡方检验，用于比例指标，例如，骑手留存率。
The Delta method (Deng et al. 2011) and bootstrap methods, used for standard error estimation whenever suitable to generate robust results for experiments with ratio metrics or with small sample sizes, e.g., the ratio of trips cancelled by riders.
Delta 方法（Deng 等人，2011 年）和自举方法，用于标准误差估计，每当适用时生成实验结果的稳健结果，例如，乘客取消的行程比率。

On top of these calculations, we use multiple comparison correction (the Benjamini-Hochberg procedure) to control the overall false discovery rate (FDR) when there are two or more treatment groups (e.g., in an A/B/C test or an A/B/N test).
除了这些计算之外，我们还使用多重比较校正（Benjamini-Hochberg 程序）来控制整体虚发现率（FDR），当有两个或更多的治疗组时（例如，在 A/B/C 测试或 A/B/N 测试中）。

The power calculation provides additional information about the level of confidence users should put into their analysis. An experiment with low power will suffer from high false negative rates (Type-II error) and high FDRs. In the power calculations our XP conducts, a t-test is always assumed. On the flipside, required sample size calculation is the opposite of a power calculation and estimates how many users are required by the experiment for it to achieve a high power (0.8).
功率计算提供了关于用户应该对其分析放置信心水平的额外信息。功率较低的实验将遭受高假阴性率（II 型错误）和高 FDR。在我们的 XP 进行的功率计算中，总是假定进行 t 检验。相反，所需样本量计算是功率计算的反面，估计实验需要多少用户才能实现高功率（0.8）。

Metrics management 度量管理

As the number of the metrics used by the XP’s analytics component grows (incorporating 1,000+ metrics), it becomes more and more challenging for users to determine the proper metrics to evaluate the performance of an experiment. To make it easier for new users of our analytics tool to uncover these metrics, we built a recommendation engine that facilitates the discovery of metrics available on our platform.
随着 XP 分析组件使用的指标数量增加（整合了 1,000 多个指标），用户越来越难以确定评估实验性能的适当指标。为了让我们的分析工具的新用户更容易发现这些指标，我们构建了一个推荐引擎，促进了在我们平台上可用指标的发现。

At Uber, there are two common collaborative filtering methods used for content recommendation: item-based and user-based methods. We primarily use an item-based recommendation engine since the characteristics of the experimenter do not typically have a strong influence on their project. For instance, if an experimenter switches to the Uber Eats team from the Rider team, it’s not necessary for the algorithm to review the previous, Uber Eats-inspired choices of that experimenter when selecting metrics to evaluate.
在 Uber，有两种常见的协同过滤方法用于内容推荐：基于项目和基于用户的方法。我们主要使用基于项目的推荐引擎，因为实验者的特征通常不会对其项目产生很大影响。例如，如果一个实验者从骑手团队转到 Uber Eats 团队，算法在选择评估指标时不需要审查该实验者之前受 Uber Eats 启发的选择。

Recommendation engine methodology
推荐引擎方法论

To determine how correlated two metrics are to each other, we add their popularity and absolute scores, enabling us to better understand their relationship. The two basic approaches to calculating these scores are:
确定两个指标之间的相关性，我们将它们的受欢迎程度和绝对分数相加，从而更好地理解它们之间的关系。计算这些分数的两种基本方法是：

Popularity score: The more frequently two metrics are selected together across experiments, the higher the score assigned to their relationship. We use the Jaccard Index to help users discover the most relevant metric once they select their initial metric. This score accounts for the experimenters’ metrics selection from past experiments.
受欢迎度评分：在实验中两个指标被频繁地一起选择的次数越多，它们之间的关系被分配的分数就越高。我们使用 Jaccard 指数来帮助用户在选择初始指标后发现最相关的指标。这个评分考虑了实验者从过去实验中选择的指标。
Absolute score: Using our XP, we can generate a pool of user samples from our metrics and calculate the Pearson correlation score of the two metrics. This accounts for serendipitous discovery; namely, the experimenter may not have considered adding a metric to the experiment since it is not directly related, but it might be moving with the user-selected metric.
绝对分数：使用我们的 XP，我们可以从我们的指标中生成一组用户样本，并计算两个指标的皮尔逊相关分数。这考虑到偶然发现；即实验者可能没有考虑将指标添加到实验中，因为它与用户选择的指标没有直接关联，但它可能与用户选择的指标一起变化。

After calculating these two scores, we add the score of the two steps above with relative weights on each term and recommend the metrics with the highest score to the experimenter based on their first choice of metrics.
在计算这两个分数之后，我们将上述两个步骤的分数按照各项的相对权重相加，并根据实验者对指标的首选，推荐得分最高的指标。

Insights discovery 洞察发现

As Uber continues to scale, it becomes more and more challenging to mine our metrics knowledge base. Our recommendation engine enables both global and local teams to access the information they need quickly and easily, allowing them to improve our services accordingly.
随着优步的规模不断扩大，挖掘我们的指标知识库变得越来越具有挑战性。我们的推荐引擎使全球和本地团队能够快速轻松地访问他们所需的信息，从而使他们能够相应地改进我们的服务。

For example, if an experimenter wants to measure the treatment effect on driver-partner supply hours, it may not be obvious to the experimenter to also add the number of trips taken by new riders as a metric, since this experiment focuses on the driver side of the trip equation. However, both metrics are important for this experiment because of the dynamics of our marketplace. Our recommendation engine helps data scientists and other users discover important metrics that may not have been obvious.
例如，如果实验者想要衡量治疗效果对司机合作伙伴供应小时的影响，对实验者来说也许不明显的是，还应该将新乘客乘坐的次数作为一个指标，因为这个实验侧重于行程方程式的司机一侧。然而，由于我们市场的动态，这两个指标对这个实验都很重要。我们的推荐引擎帮助数据科学家和其他用户发现可能不明显的重要指标。

Sequential testing 顺序测试

While traditional A/B testing methods (for example, a t-test) inflate Type-I error by repeatedly taking subsamples, sequential testing offers a way to continuously monitor key business metrics.
传统的 A/B 测试方法（例如 t 检验）通过反复取子样本来增加第一类错误，而顺序测试提供了一种持续监测关键业务指标的方式。

One use case where a sequential test comes in handy for our team is when identifying outages caused by the experiments running on our platform. We cannot wait until a traditional A/B test collects sufficient sample sizes to determine the cause of an outage; we want to make sure experiments are not introducing key degradations of business metrics as soon as possible, in this case, during the experimentation period. Therefore, we built a monitoring system powered by a sequential testing algorithm to adjust the confidence intervals accordingly without inflating Type-I error.
一个使用顺序测试非常方便的用例是我们团队在识别由运行在我们平台上的实验引起的中断时。我们不能等到传统的 A/B 测试收集足够的样本量来确定中断的原因；我们希望尽快确保实验不会引入关键业务指标的降级，即在实验期间。因此，我们构建了一个由顺序测试算法驱动的监控系统，以相应地调整置信区间，而不会增加第一类错误。

Using our XP, we conduct periodic comparisons about these business metrics, such as app crash rates and trip frequency rates, between treatment and control groups for ongoing experiments. Experiments continue if there are no significant degradations, otherwise they will be given an alert or even paused. The workflow for this monitoring system is shown in Figure 6, below:
使用我们的 XP，我们对这些业务指标进行定期比较，例如应用程序崩溃率和行程频率率，在持续实验中的处理组和对照组之间。如果没有显著的恶化，实验将继续进行，否则将收到警报甚至暂停。此监控系统的工作流程如下图 6 所示：

Figure 6. We integrate the sequential test methodology into the workflow of our XP’s outage monitoring system.
图 6. 我们将顺序测试方法论整合到我们 XP 的停机监控系统的工作流程中。

Methodologies 方法论

We leverage two main methodologies to perform sequential testing for metrics monitoring purposes: the mixture sequential probability ratio test (mSPRT) and variance estimation with FDR.
我们利用两种主要方法来执行用于度量监控目的的顺序测试：混合顺序概率比检验（mSPRT）和 FDR 的方差估计。

Mixture Sequential Probability Ratio Test
混合顺序概率比检验

The most common method we use for monitoring is mSPRT. This test builds on the likelihood ratio test by incorporating an extra specification of mixing distribution H. Suppose we are testing the metric difference with the null hypothesis being , then the test statistics could be written as = . Sinces we have large sample sizes and the central limit theorem can be applied to most cases, we use normal distribution as our mixing distribution, . This leads to easy computation and a closed form expression for . Another useful property about this method is under null hypothesis, nH, 0 is proven to be a martingale: . Following this, we could construct confidence interval.
我们用于监测的最常见方法是 mSPRT。该测试基于似然比检验，并加入了混合分布 H 的额外规范。假设我们正在测试指标差异，零假设为，那么测试统计量可以写成 = 。由于我们有大样本量，并且中心极限定理可以应用于大多数情况，我们将正态分布作为我们的混合分布，。这导致了简单的计算和的封闭形式表达。关于这种方法的另一个有用特性是在零假设下，nH，0 被证明是一个鞅：。在此基础上，我们可以构建置信区间。

Variance estimation with FDR control
使用 FDR 控制进行方差估计

To apply sequential testing correctly, we need to estimate variance as accurately as possible. Since we monitor the cumulative difference between our control and treatment groups on a daily basis, observations from the same users introduce correlations which violate the assumption of the mSPRT test. For example, if we are monitoring click through rates, then the metric from one user across multiple days may be correlated. To overcome this, we use delete-a-group jackknife variance estimation/block bootstrap methods to generalize mSPRT test under correlated data.
为了正确应用顺序检验，我们需要尽可能准确地估计方差。由于我们每天监测控制组和治疗组之间的累积差异，来自相同用户的观察结果引入了违反 mSPRT 测试假设的相关性。例如，如果我们正在监测点击率，那么同一用户在多天内的指标可能是相关的。为了克服这一问题，我们使用删除一组交叉验证方差估计/块自举方法来推广在相关数据下的 mSPRT 测试。

Since our monitoring system wants to evaluate the overall health of an ongoing experiment, we monitor many business metrics at the same time, potentially leading to false alarms. In theory, either the Bonferroni or BH correction could be applied in this scenario. However, since the potential loss of missing business degradations can be substantial, we apply BH correction here and also tune in parameters (MDE, power, tolerance for practical significance, etc.) for metrics with varying levels of importance and sensitivity.
由于我们的监控系统希望评估正在进行的实验的整体健康状况，我们同时监控许多业务指标，可能导致虚假警报。理论上，Bonferroni 或 BH 校正都可以应用于这种情况。然而，由于错过业务恶化可能造成的潜在损失可能很大，我们在这里应用 BH 校正，并且还调整参数（MDE、功率、对实际意义的容忍度等）以适应不同重要性和敏感性水平的指标。

Use cases 用例

Suppose we want to monitor a key business metric for a specific experiment, as depicted in Figure 7, below:
假设我们想要监控特定实验的关键业务指标，如下图 7 所示：

Figure 7. The sequential test methodology indicates a significant difference between our treatment and control groups, as identified in Plot B. In contrast, no significant difference is identified in Plot A.
图 7. 顺序测试方法表明我们的治疗组和对照组之间存在显著差异，如图 B 所示。相比之下，图 A 中未发现显著差异。

The red lines Plots A and B signify the observed cumulative relative difference between our treatment and control groups. The red band is the confidence interval for this cumulative relative difference.
红色线条 A 和 B 表示我们的治疗组和对照组之间观察到的累积相对差异。红色带是这种累积相对差异的置信区间。

As time passes, we accumulate more samples and the confidence interval narrows. In Plot B, the confidence interval consistently deviates from zero starting on a given date, in this example, November 21. With an extra threshold (in other words, tolerance for our monitoring system) for practical significance imposed, metrics degradation is detected to be both statistically and practically significant after a certain date. In contrast, Plot A’s confidence interval shrinks but always includes 0. Thus, we didn’t detect any regressions for the crash monitored in Plot A.
随着时间的推移，我们积累了更多的样本，置信区间变窄。在图 B 中，置信区间从特定日期开始持续偏离零，例如在本例中是 11 月 21 日。在实际意义上施加了额外的阈值（换句话说，对我们的监测系统的容忍度），在某个日期之后检测到指标的退化在统计上和实际上都是显著的。相比之下，图 A 的置信区间收缩但始终包括 0。因此，在图 A 中监测到的崩溃中我们没有检测到任何回归。

Continuous experiments 连续实验

To accelerate innovation and learning, the data science team at Uber is always looking to optimize driver, rider, eater, restaurant, and delivery-partner experiences through continuous experiments. Our team has implemented bandit and optimization-focused reinforcement learning methods to learn iteratively and rapidly from the continuous evaluation of related metric performance.
为了加速创新和学习，Uber 的数据科学团队始终致力于通过持续实验优化司机、乘客、用餐者、餐厅和送餐合作伙伴的体验。我们的团队已经实施了赌博和优化为重点的强化学习方法，通过对相关指标表现的持续评估进行迭代和快速学习。

Recently, we completed an experiment using bandit techniques for content optimization to improve customer engagement. The technique helped improve customer engagement compared to classic hypothesis testing methods. Figure 9, below, outlines Uber’s various continuous experiment use cases, including content optimization, hyper-parameter tuning, spend optimization, and automated feature rollouts:
最近，我们完成了一项实验，使用强盗技术进行内容优化，以提高客户参与度。与经典的假设检验方法相比，这种技术有助于提高客户参与度。下图 9 概述了 Uber 的各种连续实验用例，包括内容优化、超参数调整、花费优化和自动特性发布：

Figure 9. Uber’s XP leverages continuous experiments for a variety of use cases, including hyper-parameter tuning and automated feature rollouts.
图 9. Uber 的 XP 利用持续实验来进行各种用例，包括超参数调整和自动特征发布。

In Case Study 1, we outline how bandits have helped optimize email campaigns and enhance rider engagement at Uber. Here, the Uber Eats Customer Relationship Management (CRM) team in Europe, the Middle East, and Africa (EMEA) launched an email campaign to encourage order momentum early in the customer life cycle. The experimenters plan to run a campaign with ten different email subject lines and find out the best subject line in terms of the open rate and the number of open emails. Figure 10, below, details this case study:
在案例研究 1 中，我们概述了强盗如何帮助优化 Uber 的电子邮件营销活动并增强骑手参与度。在这里，Uber Eats 欧洲、中东和非洲（EMEA）客户关系管理（CRM）团队推出了一项电子邮件营销活动，以鼓励客户生命周期早期的订单动力。实验者计划进行一项包含十个不同电子邮件主题行的活动，并找出在开放率和开放电子邮件数量方面最佳主题行。下面的图 10 详细介绍了这个案例研究：

A second example of how we leverage continuous experiments is parameter tuning. Unlike the first case, the second case study uses a more advanced bandit algorithm, the contextual multi-armed bandit technique, which combines statistical experiments and machine learning modeling. We use contextual MAB to choose the best parameters in a machine learning model.
我们利用持续实验的第二个例子是参数调整。与第一个案例不同，第二个案例研究使用了更先进的赌博算法，即上下文多臂赌博技术，它结合了统计实验和机器学习建模。我们使用上下文多臂赌博来选择机器学习模型中的最佳参数。

As depicted in Figure 11, below, the Uber Eats Data Science team leveraged MAB testing to create a linear programming model, called the multiple-objective optimization (MOO), that ranks restaurants on the main feed of the Uber Eats app:
如下图 11 所示，Uber Eats 数据科学团队利用 MAB 测试创建了一个线性规划模型，称为多目标优化（MOO），该模型在 Uber Eats 应用程序的主要信息流中对餐厅进行排名：

The algorithm behind MOO incorporates several metrics, such as session conversion rate, gross booking fee, and user retention rate. However, the mathematical solution contains a set of parameters that we need to give to the algorithm.
MOO 背后的算法包含几个指标，如会话转化率、总预订费和用户留存率。然而，数学解决方案包含一组参数，我们需要提供给算法。

These experiments contain many parameter candidates for use with our ranking algorithms. The ranking results depend on the hyper-parameters we chose for the MOO model. Therefore, to improve the performance of the MOO model, we hope to figure out the best hyper-parameters via multi-armed bandits algorithm. The traditional A/B test framework is too time-intensive to handle each test, so we decided to utilize the MAB method for these experiments. MAB is able to provide a framework to quickly tune these parameters.
这些实验包含许多参数候选项，可用于我们的排名算法。排名结果取决于我们为 MOO 模型选择的超参数。因此，为了提高 MOO 模型的性能，我们希望通过多臂老虎机算法找出最佳超参数。传统的 A/B 测试框架太耗时，无法处理每个测试，因此我们决定利用 MAB 方法进行这些实验。MAB 能够提供一个快速调整这些参数的框架。

We chose the contextual MAB and the Bayesian optimization methods to find the maximizers of a black box function optimization problem. Figure 12, below, outlines the setup of this experiment:

Figure 12: Our XP leverages contextual MABs for hyper-parameter tuning.

As shown above, contextual Bayesian optimization works well with both personalized information and exploration-exploitation trade-offs.

Moving Forward

As a result of its scale and global impact, Uber’s problem space poses unique challenges. As our methodologies evolve, we aspire to build an ever more intelligent experimentation platform. In the future, this platform will provide insights gleaned not only from current experiments, but also previous ones, and, over time, proactively predict metrics.

Uber’s Experimentation Platform team is hiring. If you are passionate about experimentation and machine learning, please apply for this role.

Subscribe to our newsletter to keep up with the latest innovations from Uber Engineering.

Anirban Deb
Anirban Deb is the former tech lead of the Experimentation, Segmentation, Personalization and Mobile App Development Platform data science teams at Uber and currently heading the Uber Freight data science organization.
Suman Bhattacharya
Suman Bhattacharya is a senior data scientist on Uber’s Experimentation Platform team.
Jeremy Gu
Jeremy Gu is a data scientist on Uber’s Experimentation Platform team.
Tianxia Zhou
Tianxia Zhou is a data scientist on Uber's Experimentation Platform team.
Eva Feng
Eva Feng is a data scientist on Uber's Experimentation Platform team.
Mandie Liu
Mandie Liu is a data scientist on Uber’s Experimentation Platform team.

Posted by Anirban Deb, Suman Bhattacharya, Jeremy Gu, Tianxia Zhou, Eva Feng, Mandie Liu

Category:

Data / ML

Engineering, Backend, Data / ML

How Uber Accomplishes Job Counting At Scale

22 May / Global

Data / ML

DataK9: Auto-categorizing an exabyte of data at field level through AI/ML

9 May / Global

Engineering, AI, Data / ML

From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey

2 May / Global

Engineering, Backend, Data / ML

How LedgerStore Supports Trillions of Indexes at Uber

4 April / Global

Engineering, AI, Data / ML

Scaling AI/ML Infrastructure at Uber

28 March / Global

Most popular

Engineering, AI, Data / ML28 March / Global
Scaling AI/ML Infrastructure at Uber
Engineering, Backend, Data / ML4 April / Global
How LedgerStore Supports Trillions of Indexes at Uber
Engineering11 April / Global
Migrating a Trillion Entries of Uber’s Ledger Data from DynamoDB to LedgerStore
Promotions15 April / Hong Kong
“AlipayHK Uber APP A. Point reward” Terms & Conditions