Tesla AI Day  特斯拉人工智能日

Deep Understanding Tesla FSD Part 1: HydraNet

From Theory to Reality, Analyze the Evolution of Tesla Full Self-Driving

11 min readOct 18, 2021

From Tesla AI Day

Almost a month ago Tesla hosted Tesla AI Day. In this event, Tesla introduced AI & autopilot completely and in detail for the first time.

As an AI practitioner, especially if you focus on the autonomous driving domain, you should study the first part of Tesla AI Day. A few weeks after the event, I reviewed the video “frame by frame”, searched, downloaded, read all the papers involved in the video, and took a lot of notes. Gradually, I outlined the architecture of Tesla’s FSD.

Next, I will try to explore how Tesla fulfilled its promise of artificial intelligence & autopilot from the perspective of a software engineer.

Before starting, please think about a question with me. If you act as Sr. Director of Tesla AI and lead AI Team, how will you achieve autonomous driving?

Cameras, Lidars, Machine Learning, Neural Network, Maps, HD Maps, Papers, Labels, Training, Testing, DataSets, Planning, Security, Chips, CPUs, GPUs, Mass Data Traning, ethics of AI…, all these things suddenly flooded my brain. The conclusion is that this is a mission impossible for me.

Let’s take a look at Tesla’s solution.

  1. How Do We Make A Car Autonomous?
  2. How Do We generate training data?
  3. How Do we run it in the car?
  4. How Do we iterate quickly?

In AI Day, Andrej Karpathy, the Sr. Director of Tesla AI, and his colleagues, Ashok Elluswamy, Milan Kovac, showed us their solutions around these four questions.
在 AI Day 上,特斯拉 AI 高级总监Andrej Karpathy和他的同事Ashok ElluswamyMilan Kovac向我们展示了他们围绕这四个问题的解决方案。

How Do We Make A Car Autonomous?

Basic Capacity: Vision

First look at the clip below, this is the final result of Tesla Vision in the current version. The 8 cameras(Left) around vehicle generate 3-Dimensional “Vector Space” (Right)through Neural Networks, which represents everything you need for driving, such as lines, edges, curbs, traffic signs, traffic lights, cars; and positions, orientations, depth, velocities of cars.

From: Tesla

The original design was inspired by the study of human or animal vision.

Figure 1. From Tesla  图1。来自特斯拉

The three small images in the above picture show how the human and the primate(macaques, right image) cerebral cortex process vision. After the information hits the retina, it goes through a number of areas, streams, and layers of the cerebral cortex, finally form a biological vision. These areas and organs included: Optic chiasm, the lateral geniculate nucleus(LGN), the primary visual cortex (V1), Extrastriate cortex (V2, V3, V4…), inferior temporal area, and so on.

What Tesla has to do is to build a vision-based computer neural network system like the human brain. Through software, hardware and algorithms design the visual cortex of the car.

The input of Tesla Vision comes from raw format (digital negatives) video data provided by its eyes — 8 cameras(1280x960 12-Bit(HDR)@ 36Hz). You may have discovered that there are no other sensors such as Lidars or mmWave radars except for the cameras. Later Andrej will explain and prove why Tesla decided to use only 8 cameras.
特斯拉视觉的输入来自其眼睛提供的原始格式(数字底片)视频数据-8个相机(1280x960 12位(HDR)@ 36Hz)。您可能已经发现,除相机外,没有其他传感器,例如激光雷达或MMWave雷达。后来,安德烈(Andrej)将解释并证明为什么特斯拉(Tesla)决定只使用8台摄像机。

Terms:  术语:

There are some terms for object detection tasks.

Backbone: refers to the feature extracting network, which is used to recognize several objects in a single image and provides rich features information of objects. We often use AlexNet, ResNet, VGGNet as backbone networks.

Detection Head(head): After the feature extract (backbone), it gives us a feature maps representation of the input. For some actual tasks, such as detection object, segmentation, etc. We usually apply a “detection head” on the feature maps, so it’s like a head attached to the backbone.

neck: The neck is between the backbone and head, it is used to extract some more elaborate features.(e.g. feautre pyramid network(FPN), BiFPN)

There is a general structure for object detection:

Input → backbone → neck → head → Output

Figure 2. Object Detection Structure

In Tesla Neural Network aritecture:

backbone: RegNet + ResNet
骨干:regnet + Resnet

neck: BiFPN

head: HydraNet

Figure 3. From Tesla  图3。来自特斯拉

Next, I will try to explain why Tesla AI chose such an architecture.

Neural Network Backbone  神经网络骨干

Figure 4. Backbone, From Tesla

Initially, in the object detection task, we used some manually designed networks, such as AlexNet[13], VGG[26], ResNet[8], DenseNet[11]…, as the backbone. Later, as the scale of data and network depth increased, researchers began to consider using semi-automated network design and automated network design instead of manual network design. Well-known paradigms at this stage are AutoML and NAS(Neural architecture search).
最初,在对象检测任务中,我们使用了一些手动设计的网络,例如Alexnet [13],VGG [26],Resnet [8],Densenet [11]…,作为骨干。后来,随着数据和网络深度的规模的增加,研究人员开始考虑使用半自动化网络设计和自动化网络设计而不是手动网络设计。在此阶段,著名的范例是AutomlNAS (神经建筑搜索)。

Despite the effectiveness of AutoML and NAS, they have limitations: 1) high resource consumption, 2) poor flexibility, 3) poor generalization, 4) design results are hard to understand.

Tesla uses the Regnet(regular network structures) designed with residual neural network blocks as its neural network backbone.

RegNet is a new network design paradigm presented in the 2020 Facebook AI Research (FAIR) paper Designing Network Design Spaces.
Regnet是一种新的网络设计范式,该范式在2020年Facebook AI研究(公平)纸张设计空间中提出。

Instead of focusing on designing individual network instances( likes NAS), this paper designs network design spaces that parameterize populations of networks, which means exploring network structure (e.g., width, depth, groups, etc.) assuming standard model families including VGG, ResNet, and ResNeXt. Finally, it will get a low-dimensional design space consisting of simple “regular” networks — RegNet.
本文设计的网络设计空间不是专注于设计单个网络实例(喜欢NAS),而不是将网络种群参数化,这意味着探索网络结构(例如,宽度,深度,组等),假设包括VGG,Resnet在内的标准模型家族和Resnext。最后,它将获得一个低维的设计空间,该设计空间由简单的“常规”网络组成 - Regnet。

Andrej also gave the reason why Tesla AI uses RegNet:

  1. a very nice design space.
  2. Trade-off latency and accuracy.


Simply, the paper first designs an initial, unconstrained design space AnyNet, and then uses the standard residual bottleneck block to form AnyNetX.

Figure 5.  图5。

The image above(Upper right in Figure 4.) is a general network structure of the AnyNet. The network is divided into three parts:

Stem: Use a convolution (kernel_size =3, stride =2, w0 = 32 output channels) to process the images and resolution reduced by one-half.
Stem :使用卷积(kernel_size = 3,步幅= 2,W0 = 32输出通道)来处理图像和分辨率减少一半。

Body: Performs the bulk of the computation. The network body is composed of a sequence of stages that operate at progressively reduced resolution ri. Each stage consists of a sequence of identical blocks.

Head: Predicts n outputs classes.

Stage: All convolutional layers producing output maps of the same size are in the same network stage. The feature maps of different stages are used to form the feature pyramid network. (explained later in the article)
阶段:所有产生相同大小输出图的卷积层均处于同一网络阶段。不同阶段的特征地图用于形成特征金字塔网络。 (在文章后面解释)

Figure 6.  图6。

The image above (bottom right in Figure 4.) shows the two types of X block. The X block is based on the standard residual bottleneck with group convolution (See paper Aggregated Residual Transformations for Deep Neural Networks). Each X block consists of a 1x1 conv, a 3x3 group conv, and a final 1x1 conv, where the 1x1 conv alter the channel width. (b) The stride two(s=2) version.
上面的图像(图4的底部右下角显示了X块的两种类型。 X块基于具有组卷积的标准残留瓶颈(请参阅深度神经网络的纸张汇总残差转换)。每个X块由1x1 conv ,一个3x3 group conv和最终的1x1 conv组成,其中1x1 conv会改变通道宽度。 (b)大步两个(s = 2)版本。

The design process of the design space is mainly based on building the body of the network. AnyNetX design space has 16 degrees of freedom as each network consists of 4 stages and each stage has 4 parameters: the number of blocks di, block width wi, bottleneck ratio bi, and group with gi.
设计空间的设计过程主要基于建立网络的主体。 Anynetx设计空间具有16个自由度,因为每个网络由4个阶段组成,并且每个阶段都有4个参数:块di ,块宽度wi ,瓶颈比率bi和具有gi的组的块数量。

According to the analysis method EDF( The error empirical distribution function) defined in the paper, after simplification step by step, AnyNetX finally evolved into a design space with 6 degrees of freedom — Regnet. The 6 parameters: d(network depth), w0(initial width/output channel), slope wa, wm(width multiplier), b(bottleneck ratio) and g (group convolution width).
根据分析方法EDF(误差经验分布函数)在论文中定义的,在简化逐步简化后,Anynetx最终演变成具有6个自由度的设计空间 - regnet。 6个参数: d (网络深度), w0 (初始宽度/输出通道),斜率wawm (宽度乘数), b (瓶颈比)和g (组卷积宽度)。

After the neural network backbone processing, the RegNet gives a number of features at different resolutions in different scales. In this feature extracting network, on the very bottom we have very high resolution with very low channel counts, and low resolution with high channel counts at the top. So the neurons on the bottom are used to scrutinize the details of the image, on the top, the neurons are used to understand the scene context (semantic) information. These features of different scales and resolutions will enter the next processing — feature pyramid network.
经过神经网络主干处理后,RegNet 给出了不同尺度、不同分辨率的多个特征。在这个特征提取网络中,在最底部,我们具有非常高分辨率和非常低的通道数,并且在顶部具有低分辨率和高通道数。因此,底部的神经元用于仔细检查图像的细节,顶部的神经元用于理解场景上下文(语义)信息。这些不同尺度和分辨率的特征将进入下一步处理——特征金字塔网络。

Feature pyramid networks (Neck)

Early object detection algorithms usually directly connect the detection head on the feature map of the last layer of the last stage of the backbone. In the object detection task, the shallow networks (bottom of the networks) have a high resolution, which is helpful for the learning of image details; the deep networks (top of the networks), a resolution is low, which is good for semantic learning. In practice, we find that it is difficult to effectively identify objects of different scales on a single feature map at the same time.

Therefore, The feature maps of different stages form a feature pyramid network to characterize objects of different scales and then do object detection based on the feature pyramid.

Figure 7 Source
图 7来源

The evolution of FPN: Model (a) is a traditional featured image pyramid which is using an image pyramid to build a feature pyramid. Features are computed on each of the image scales independently. It is very slow. Model (b) is using deep convolutional networks(ConvNets) in a single feature map. This method represents more higher-level semantics. Model (c) is the Single Shot Detector(SSD) algorithm, which would reuse the multi-scale feature maps from different layers to predict. But it has weak semantics at a low level. Model (d) is an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections. This architecture learns from the detection strategy of SSD and the “shortcut connections” in ResNet.
FPN的演变:模型(a)是传统的特征图像金字塔,是利用图像金字塔构建特征金字塔。特征是在每个图像尺度上独立计算的。它非常慢。模型 (b) 在单个特征图中使用深度卷积网络(ConvNets)。该方法代表了更高层次的语义。模型(c)是单次检测器(SSD)算法,它将重用来自不同层的多尺度特征图进行预测。但它在低层次上语义较弱。模型(d)是一种架构,通过自上而下的路径和横向连接将低分辨率、语义强的特征与高分辨率、语义弱的特征结合起来。该架构学习了SSD的检测策略和ResNet中的“快捷连接”。

For details, please refer to the paper: Feature Pyramid Networks for Object Detection(FPN) by Facebook AI Research (FAIR)
详情请参考论文:Facebook AI Research (FAIR) 的Feature Pyramid Networks for ObjectDetection(FPN)

Figure 8. BiFPN  图 8.BiFPN

Back to Tesla AI Day, how to recognize the low-resolution car in the image above? They use BiFPN to achieve multi-scale feature pyramid fusion.
回到特斯拉AI Day,如何识别上图中低分辨率的汽车?他们利用 BiFPN 实现多尺度特征金字塔融合。

BiFPN is a weighted bi-directional feature pyramid network proposed in the paper, EfficientDet: Scalable and Efficient Object Detection(BiFPN) by Google Research, Brain Team in 2019(v1).
BiFPN是Google Research Brain Team在2019年(v1)论文EfficientDet: Scalable and Efficient ObjectDetection(BiFPN)中提出的一种加权双向特征金字塔网络。

BiFPN is an enhanced version of FPN. There are two main improvements:
BiFPN 是 FPN 的增强版本。主要有两个改进:

  1. After the top-down feature fusion, it does fusion again from the bottom-up.
  2. When fusing features, BiFPN observes that since different input features are different resolutions, they usually contribute to the output feature unequally. So they add an additional weight for each input.
    在融合特征时,BiFPN 观察到,由于不同的输入特征具有不同的分辨率,因此它们通常对输出特征的贡献不相等。因此他们为每个输入添加了额外的权重。
Figure 9. From BiFPN
图 9. 来自BiFPN

In the paper, BiFPN is used with EfficientDet. Tesla AI replaced the EfficientDet as the backbone with Regnet.
论文中,BiFPN 与 EfficientDet 一起使用。 Tesla AI 用 Regnet 取代了 EfficientDet 作为骨干。

Figure 10. From BiFPN, Edit By Author
图 10. 来自BiFPN,由作者编辑

Detection Head  检测头

Figure 11. Detection Head
图 11. 检测头

After the BiFPN layer, the detection head part of the network is connected. This detection head is composed of a number of task-specific heads.

As shown in figure 11, when you want to detect a car, the Tesla AI team will use a one-stage YOLO-like object detector as the head.

This YOLO is not the “you only live once” in the r/wallstreetbets event, this event was a famous short squeeze of the stock event in a few days in early 2021.
这次YOLO并不是r/wallstreetbets事件中的“you only live Once”,这个事件是2021年初几天内著名的股票轧空事件。

This YOLO refers to “you only look once”, a new approach to object detection, from the paper “You Only Look Once: Unified, Real-Time Object Detection”.

Figure 12. from Paper: You Only Look Once: Unified, Real-Time Object Detection
图 12.来自论文:你只看一次:统一的实时对象检测

In this algorithm, you only look once at an image to predict what objects are present(classification task) and where they are(regression bounding boxes + confidence).

Back to Tesla AI Day, they initialize a raster, and there’s a binary bit per position telling you whether or not there’s a car there. In addition, if there is, here’s a bunch of other attributes such as (x, y)coordinate, the width, and height of the bounding box, what type of car is this…

Figure13, from Tesla AI Day
图 13,来自 Tesla AI Day

In Figure 13, cls means classification, reg means regression bounding boxes + confidence, 640x480 is the output resolution, 4 of 640x480x4 includes (x, y)coordinate, the width, and height of the bounding box, a total of 4 outputs.
图13中, cls表示分类, reg表示回归边界框+置信度, 640x480是输出分辨率, 640x480x4中的4包括(x,y)坐标、边界框的宽度和高度,总共4个输出。

HydraNets  海德拉网

In Tesla FSD mission, there are a large number of tasks to do not just the task of detecting cars. For example, traffic light recognition and detection, lane prediction, and so on.

Figure 14. From Tesla AI Day

They converge these tasks in a new architectural layout where there’s a commonly shared backbone and branches off into a number of heads. This architecture is called the HydraNets.

This HydraNet has nothing to do with the hydra organization in Marvel. Just kidding, forgive me for being a Marvel fan. Hydra is a serpentine water monster in Greek and Roman mythology. According to legend, it has nine or more heads. This should only mean that there are multiple detection heads in the network.

The HydraNets have three main benefits:

  1. Feature Sharing: Reduced repetitive convolution calculations, reduce the number of backbones, especially efficient at test-time
  2. De-Couples Tasks: De-couple the specific tasks from the backbone, able to fine-tune tasks individually
  3. Representation Bottleneck: Cache features during training, when they are doing fine-tuning workflow, only use the cached feature to fine-tune the heads.

HydraNet training workflows:

  1. Do an end-to-end training, where they train everything jointly
  2. Cache the features at the multi-scale feature level.
  3. Fine-tune each specific task using the cached features
  4. End-to-end training once again and iterate.

The figure below is some predictions obtained by processing individual images in one version of HydraNet a few years ago.

Figure 15. From Tesla AI Day

Above we have explored Tesla AI’s neural network — HydreNet on monocular object detection.
在上面,我们探索了Tesla AI的神经网络 - 在单眼对象检测上的Hydrenet。

We know that the single-camera model can only complete simpler assisted driving tasks such as lane-keeping. More complex autonomous driving tasks require the use of multiple cameras as input to the perception system.

How does Tesla solve this problem? I will continue to explore in the next article.

Feeling from Tesla AI Day

If you do all of the engineering correctly, even Mission Impossible will also be easily solved.

Jason Zhang
Jason Zhang

