Introduction 介绍
Learn how to approach Machine Learning System Design.
学习如何着手进行机器学习系统设计。
1. What should you expect in a machine learning interview?#
1. 在机器学习面试中你应该期待什么?#
-
Most major companies, i.e. Facebook, LinkedIn, Google, Amazon, and Snapchat, expect Machine Learning engineers to have solid engineering foundations and hands-on Machine Learning experiences. This is why interviews for Machine Learning positions share similar components with interviews for traditional software engineering positions. The candidates go through a similar method of problem solving (Leetcode style), system design, knowledge of machine learning and machine learning system design.
大多数主要公司,如 Facebook、LinkedIn、Google、Amazon 和 Snapchat,期望机器学习工程师具备扎实的工程基础和实际的机器学习经验。这就是为什么机器学习职位的面试与传统软件工程职位的面试有相似的组成部分。候选人需要通过类似的问题解决方式(Leetcode 风格)、系统设计、机器学习知识以及机器学习系统设计的考核。 -
The standard development cycle of machine learning includes data collection, problem formulation, model creation, implementation of models, and enhancement of models. It is in the company’s best interest throughout the interview to gather as much information as possible about the competence of applicants in these fields. There are plenty of resources on how to train machine learning models and how to deploy models with different tools. However, there are no common guidelines for approaching machine learning system design from end to end. This was one major reason for designing this course.
机器学习标准开发周期包括数据收集、问题定义、模型创建、模型实施及模型优化。在面试过程中,公司应尽可能多地收集申请人在这些领域的能力信息,这符合公司的最大利益。关于如何训练机器学习模型以及使用不同工具部署模型的资源丰富,然而,从端到端设计机器学习系统的通用指南却较为缺乏。这正是设计本课程的一大原因。
2. How will this course help you?#
2. 这门课程将如何帮助你?#
In this course, we will learn how to approach machine learning system design from a top-down view. It’s important for candidates to realize the challenges early on and address them at a structural level. Here is one example of the thinking flow.
在本课程中,我们将学习如何从自上而下的视角来设计机器学习系统。对于候选人来说,尽早认识到挑战并在结构层面上解决它们至关重要。以下是思维流程的一个示例。
Problem statement# 问题陈述#
It’s important to state the correct problems. It is the candidates job to understand the intention of the design and why it is being optimized. It’s important to make the right assumptions and discuss them explicitly with interviewers.
For example, in a LinkedIn feed design interview, the interviewer might ask broad questions:
重要的是要陈述正确的问题。候选人的工作是理解设计的意图以及为什么要进行优化。做出正确的假设并与面试官明确讨论这些假设非常重要。例如,在 LinkedIn 信息流设计面试中,面试官可能会提出广泛的问题:
Design LinkedIn Feed Ranking.
设计 LinkedIn Feed 排序。
Asking questions is crucial to filling in any gaps and agreeing on goals. The candidate should begin by asking follow-up questions to clarify the problem statement. For example:
提问对于填补任何空白和达成目标至关重要。候选人应首先通过提出后续问题来澄清问题陈述。例如:
- Is the output of the feed in chronological order?
feed 的输出是按时间顺序排列的吗? - How do we want to balance feeds versus sponsored ads, etc.?
我们如何平衡信息流与赞助广告等?
If we are clear on the problem statement of designing a Feed Ranking system, we can then start talking about relevant metrics like user agreements.
如果我们明确了设计一个 Feed 排序系统的问题陈述,就可以开始讨论用户协议等相关指标了。
Identify metrics# 识别指标#
During the development phase, we need to quickly test model performance using offline metrics. You can start with the popular metrics like logloss
and AUC
for binary classification, or RMSE
and MAPE
for forecast.
在开发阶段,我们需要使用离线指标快速测试模型性能。你可以从流行的指标开始,比如二分类的 logloss
和 AUC
,或者预测的 RMSE
和 MAPE
。
Identify requirements# 识别需求#
-
Training requirements 培训要求
- There are many components required to train a model from end to end. These components include the data collection, feature engineering, feature selection, and loss function.
For example, if we want to design a YouTube video recommendations model, it’s natural that the user doesn’t watch a lot of recommended videos. Because of this, we have a lot of negative examples. The question is asked:
训练一个端到端的模型需要许多组件。这些组件包括数据收集、特征工程、特征选择和损失函数。例如,如果我们想设计一个 YouTube 视频推荐模型,用户自然不会观看大量推荐的视频。因此,我们有很多负面例子。问题是:
- There are many components required to train a model from end to end. These components include the data collection, feature engineering, feature selection, and loss function.
For example, if we want to design a YouTube video recommendations model, it’s natural that the user doesn’t watch a lot of recommended videos. Because of this, we have a lot of negative examples. The question is asked:
How do we train models to handle an imbalance class?
Once we deploy models in production, we will have feedback in real time.
How do we monitor and make sure models don’t go stale?
-
Inference requirements
Once models are deployed, we want to run inference with low latency (<100ms) and scale our system to serve millions of users.
How do we design inference components to provide high availability and low latency?
Train and evaluate model#
-
There are usually three components: feature engineering, feature selection, and models. We will use all the modern techniques for each component.
-
For example, in Rental Search Ranking, we will discuss if we should use ListingID as embedding features. In Estimate Food Delivery Time, we will discuss how to handle the latitude and longitude features efficiently.
Design high level system#
In this stage, we need to think about the system components and how data flows through each of them. The goal of this section is to identify a minimal, viable design to demonstrate a working system. We need to explain why we decided to have these components and what their roles are.
- For example, when designing Video Recommendation systems, we would need two separate components: the Video Candidate Generation Service and the Ranking Model Service.
Scale the design#
In this stage, it’s crucial to understand system bottlenecks and how to address these bottlenecks. You can start by identifying:
- Which components are likely to be overloaded?
- How can we scale the overloaded components?
- Is the system good enough to serve millions of users?
- How we would handle some components becoming unavailable, etc.
- You can also learn more about how companies scale there design here.
Feature Selection and Feature Engineering