这是用户在 2025-4-30 18:00 为 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-ab... 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

The Log: What every software engineer should know about real-time data's unifying abstraction
《日志:每位软件工程师都应了解的实时数据统一抽象》

   杰·克雷普斯
December 16, 2013  2013 年 12 月 16 日

I joined LinkedIn about six years ago at a particularly interesting time. We were just beginning to run up against the limits of our monolithic, centralized database and needed to start the transition to a portfolio of specialized distributed systems. This has been an interesting experience: we built, deployed, and run to this day a distributed graph database, a distributed search backend, a Hadoop installation, and a first and second generation key-value store.
大约六年前,我加入 LinkedIn 时正值一个特别有趣的时期。那时我们刚刚开始触及单体式集中数据库的极限,亟需转向一系列专业分布式系统的架构转型。这段经历颇具启发性:我们构建、部署并持续运行至今的包括一个分布式图数据库、一个分布式搜索后端、一套 Hadoop 集群,以及第一代和第二代键值存储系统。

One of the most useful things I learned in all this was that many of the things we were building had a very simple concept at their heart: the log. Sometimes called write-ahead logs or commit logs or transaction logs, logs have been around almost as long as computers and are at the heart of many distributed data systems and real-time application architectures.
在整个过程中,我学到的最有用的事情之一是,我们构建的许多系统其核心都有一个非常简单的概念:日志。日志有时被称为预写日志、提交日志或事务日志,几乎与计算机的历史一样悠久,并且是许多分布式数据系统和实时应用架构的核心。

You can't fully understand databases, NoSQL stores, key value stores, replication, paxos, hadoop, version control, or almost any software system without understanding logs; and yet, most software engineers are not familiar with them. I'd like to change that. In this post, I'll walk you through everything you need to know about logs, including what is log and how to use logs for data integration, real time processing, and system building.
若不理解日志(log),你便无法真正理解数据库、NoSQL 存储、键值存储、复制、Paxos 算法、Hadoop、版本控制乃至几乎任何软件系统;然而大多数软件工程师却对日志知之甚少。我希望能改变这一现状。本文将系统介绍关于日志的所有必备知识,包括什么是日志,以及如何利用日志实现数据集成、实时处理与系统构建。

Part One: What Is a Log?
第一部分:什么是日志?

A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time. It looks like this:
日志可能是最简单的存储抽象形式。它是一个仅追加、完全按时间排序的记录序列,结构如下:

Records are appended to the end of the log, and reads proceed left-to-right. Each entry is assigned a unique sequential log entry number.
记录总是追加到日志末尾,读取则从左向右进行。每条记录都会被分配一个唯一的顺序日志条目编号。

The ordering of records defines a notion of "time" since entries to the left are defined to be older then entries to the right. The log entry number can be thought of as the "timestamp" of the entry. Describing this ordering as a notion of time seems a bit odd at first, but it has the convenient property that it is decoupled from any particular physical clock. This property will turn out to be essential as we get to distributed systems.
记录的排序定义了“时间”的概念,因为左侧的条目被定义为比右侧的更早。日志条目编号可视为该条目的“时间戳”。初看将这种排序描述为时间概念有些奇怪,但它有一个便利特性:与任何特定物理时钟解耦。这一特性在分布式系统中将显得至关重要。

The contents and format of the records aren't important for the purposes of this discussion. Also, we can't just keep adding records to the log as we'll eventually run out of space. I'll come back to this in a bit.
就本文讨论目的而言,记录的具体内容和格式并不重要。此外,我们无法无限制地向日志追加记录,因为最终会耗尽空间。稍后我会再谈及这一点。

So, a log is not all that different from a file or a table. A file is an array of bytes, a table is an array of records, and a log is really just a kind of table or file where the records are sorted by time.
因此,日志与文件或表格并无本质区别。文件是字节数组,表格是记录数组,而日志本质上就是一种按时间排序的表格或文件。

At this point you might be wondering why it is worth talking about something so simple? How is a append-only sequence of records in any way related to data systems? The answer is that logs have a specific purpose: they record what happened and when. For distributed data systems this is, in many ways, the very heart of the problem.
此时您可能疑惑:为何要讨论如此简单的事物?仅追加的记录序列如何与数据系统产生关联?答案在于日志的特定用途:它们记录发生了什么以及何时发生。对于分布式数据系统而言,这从许多方面来说正是核心问题所在。

But before we get too far let me clarify something that is a bit confusing. Every programmer is familiar with another definition of logging—the unstructured error messages or trace info an application might write out to a local file using syslog or log4j. For clarity I will call this "application logging". The application log is a degenerative form of the log concept I am describing. The biggest difference is that text logs are meant to be primarily for humans to read and the "journal" or "data logs" I'm describing are built for programmatic access.
但在深入探讨之前,我需要澄清一个容易混淆的概念。每位程序员都熟悉另一种日志定义——即应用程序通过 syslog 或 log4j 等工具写入本地文件的无结构错误信息或跟踪记录。为明确起见,我将这类称为"应用日志"。应用日志是我所描述的日志概念的退化形式。最核心的区别在于:文本日志主要供人类阅读,而本文所述的"日志"或"数据日志"则是为程序化访问构建的。

(Actually, if you think about it, the idea of humans reading through logs on individual machines is something of an anachronism. This approach quickly becomes an unmanageable strategy when many services and servers are involved and the purpose of logs quickly becomes as an input to queries and graphs to understand behavior across many machines—something for which english text in files is not nearly as appropriate as the kind structured log described here.)
(事实上,若仔细思考,人类逐台机器查阅日志的做法已显得不合时宜。当涉及众多服务和服务器时,这种方式很快会变得难以管理,日志的用途迅速转变为查询和绘制跨机器行为图表的输入数据——对于这种需求,文件中的英文文本远不如本文描述的结构化日志适用。)

Logs in databases  数据库中的日志

I don't know where the log concept originated—probably it is one of those things like binary search that is too simple for the inventor to realize it was an invention. It is present as early as IBM's System R. The usage in databases has to do with keeping in sync the variety of data structures and indexes in the presence of crashes. To make this atomic and durable, a database uses a log to write out information about the records they will be modifying, before applying the changes to all the various data structures it maintains. The log is the record of what happened, and each table or index is a projection of this history into some useful data structure or index. Since the log is immediately persisted it is used as the authoritative source in restoring all other persistent structures in the event of a crash.
我不知道日志这一概念的起源——它可能就像二分查找那样简单,以至于发明者都没意识到这是一项发明。早在 IBM 的 System R 系统中就已出现日志的身影。在数据库中,日志的作用主要是在系统崩溃时保持各种数据结构和索引的同步。为了实现操作的原子性和持久性,数据库会在修改所有维护的多种数据结构之前,先将待修改记录的相关信息写入日志。日志是事件发生的记录,而每个表或索引都是这段历史投射到某种实用数据结构或索引上的形态。由于日志会被即时持久化存储,因此在系统崩溃时,它就成为恢复所有其他持久化结构的权威依据。

Over-time the usage of the log grew from an implementation detail of ACID to a method for replicating data between databases. It turns out that the sequence of changes that happened on the database is exactly what is needed to keep a remote replica database in sync. Oracle, MySQL, and PostgreSQL include log shipping protocols to transmit portions of log to replica databases which act as slaves. Oracle has productized the log as a general data subscription mechanism for non-oracle data subscribers with their XStreams and GoldenGate and similar facilities in MySQL and PostgreSQL are key components of many data architectures.
随着时间的推移,日志的使用从 ACID 的实现细节演变为数据库间数据复制的方法。事实证明,数据库上发生的变化序列正是保持远程副本数据库同步所需的内容。Oracle、MySQL 和 PostgreSQL 都包含日志传输协议,用于将部分日志传送给作为从库的副本数据库。Oracle 已将日志产品化为面向非 Oracle 数据订阅者的通用数据订阅机制,其 XStreams 和 GoldenGate 以及 MySQL 和 PostgreSQL 中的类似功能已成为许多数据架构的关键组成部分。

Because of this origin, the concept of a machine readable log has largely been confined to database internals. The use of logs as a mechanism for data subscription seems to have arisen almost by chance. But this very abstraction is ideal for supporting all kinds of messaging, data flow, and real-time data processing.
由于这一起源,机器可读日志的概念在很大程度上局限于数据库内部。日志作为数据订阅机制的使用似乎几乎是偶然出现的。但正是这种抽象非常适合支持各种消息传递、数据流和实时数据处理。

Logs in distributed systems
分布式系统中的日志

The two problems a log solves—ordering changes and distributing data—are even more important in distributed data systems. Agreeing upon an ordering for updates (or agreeing to disagree and coping with the side-effects) are among the core design problems for these systems.
日志解决的两个核心问题——变更排序与数据分发——在分布式数据系统中尤为重要。对更新顺序达成共识(或允许分歧并处理其副作用),是这类系统设计的核心难题之一。

The log-centric approach to distributed systems arises from a simple observation that I will call the State Machine Replication Principle:
分布式系统中以日志为中心的设计方法,源于一个我称之为"状态机复制原则"的简单观察:

If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state.
如果两个完全相同的确定性进程从相同初始状态开始,并以相同顺序接收相同输入,它们将产生相同输出并最终达到相同状态。

This may seem a bit obtuse, so let's dive in and understand what it means.
这听起来可能有些晦涩,让我们深入解析其含义。

Deterministic means that the processing isn't timing dependent and doesn't let any other "out of band" input influence its results. For example a program whose output is influenced by the particular order of execution of threads or by a call to gettimeofday or some other non-repeatable thing is generally best considered as non-deterministic.
确定性意味着处理过程不依赖于时间因素,也不会让任何“带外”输入影响其结果。例如,一个程序的输出若受到线程执行顺序或调用 gettimeofday 等不可重复操作的影响,通常最好被视为非确定性的。

The state of the process is whatever data remains on the machine, either in memory or on disk, at the end of the processing.
进程的状态是指在处理结束时,机器上剩余的所有数据,无论是内存中还是磁盘上的。

The bit about getting the same input in the same order should ring a bell—that is where the log comes in. This is a very intuitive notion: if you feed two deterministic pieces of code the same input log, they will produce the same output.
关于以相同顺序获取相同输入的描述应该让人联想到——这正是日志的用武之地。这是一个非常直观的概念:如果给两个确定性代码段提供相同的输入日志,它们将产生相同的输出。

The application to distributed computing is pretty obvious. You can reduce the problem of making multiple machines all do the same thing to the problem of implementing a distributed consistent log to feed these processes input. The purpose of the log here is to squeeze all the non-determinism out of the input stream to ensure that each replica processing this input stays in sync.
在分布式计算中的应用相当明显。你可以将让多台机器执行相同操作的问题,转化为实现一个分布式一致性日志来为这些进程提供输入的问题。日志在此的作用是从输入流中消除所有非确定性,确保每个处理该输入的副本保持同步。

When you understand it, there is nothing complicated or deep about this principle: it more or less amounts to saying "deterministic processing is deterministic". Nonetheless, I think it is one of the more general tools for distributed systems design.
当你理解了这一点,这个原则就没什么复杂或深奥的了:它大致相当于在说“确定性处理就是确定性的”。尽管如此,我认为它是分布式系统设计中更为通用的工具之一。

One of the beautiful things about this approach is that the time stamps that index the log now act as the clock for the state of the replicas—you can describe each replica by a single number, the timestamp for the maximum log entry it has processed. This timestamp combined with the log uniquely captures the entire state of the replica.
这种方法的一个美妙之处在于,用于索引日志的时间戳现在充当了副本状态的时钟——你可以用一个数字来描述每个副本的状态,即它已处理的最大日志条目的时间戳。这个时间戳与日志相结合,就能完整地捕捉到副本的整个状态。

There are a multitude of ways of applying this principle in systems depending on what is put in the log. For example, we can log the incoming requests to a service, or the state changes the service undergoes in response to request, or the transformation commands it executes. Theoretically, we could even log a series of machine instructions for each replica to execute or the method name and arguments to invoke on each replica. As long as two processes process these inputs in the same way, the processes will remaining consistent across replicas.
根据日志中记录的内容不同,这一原则在系统中有多种应用方式。例如,我们可以记录服务接收到的请求,或服务为响应请求而经历的状态变化,亦或是它执行的转换命令。理论上,我们甚至可以记录每个副本要执行的一系列机器指令,或者在每个副本上调用的方法名和参数。只要两个进程以相同的方式处理这些输入,这些进程就能在副本间保持一致状态。

Different groups of people seem to describe the uses of logs differently. Database people generally differentiate between physical and logical logging. Physical logging means logging the contents of each row that is changed. Logical logging means logging not the changed rows but the SQL commands that lead to the row changes (the insert, update, and delete statements).
不同群体对日志用途的描述似乎各不相同。数据库人员通常区分物理日志和逻辑日志。物理日志指记录每个被修改行的内容。逻辑日志则不是记录被修改的行,而是记录导致行变更的 SQL 命令(即 insert、update 和 delete 语句)。

The distributed systems literature commonly distinguishes two broad approaches to processing and replication. The "state machine model" usually refers to an active-active model where we keep a log of the incoming requests and each replica processes each request. A slight modification of this, called the "primary-backup model", is to elect one replica as the leader and allow this leader to process requests in the order they arrive and log out the changes to its state from processing the requests. The other replicas apply in order the state changes the leader makes so that they will be in sync and ready to take over as leader should the leader fail.
分布式系统文献通常区分两种广泛的处理和复制方法。"状态机模型"通常指一种主动-主动模型,其中我们保留传入请求的日志,每个副本处理每个请求。对此稍作修改的"主备模型"是选出一个副本作为领导者,允许该领导者按请求到达的顺序处理请求,并记录处理请求后其状态的更改。其他副本按顺序应用领导者所做的状态更改,以便它们保持同步,并准备在领导者失败时接管领导权。

To understand the difference between these two approaches, let's look at a toy problem. Consider a replicated "arithmetic service" which maintains a single number as its state (initialized to zero) and applies additions and multiplications to this value. The active-active approach might log out the transformations to apply, say "+1", "*2", etc. Each replica would apply these transformations and hence go through the same set of values. The "active-passive" approach would have a single master execute the transformations and log out the result, say "1", "3", "6", etc. This example also makes it clear why ordering is key for ensuring consistency between replicas: reordering an addition and multiplication will yield a different result.
要理解这两种方法的区别,让我们看一个简单的例子。假设有一个复制的“算术服务”,它维护一个数字作为其状态(初始化为零),并对这个值进行加法和乘法操作。主动-主动方法可能会记录下要应用的转换,比如“+1”、“*2”等。每个副本都会应用这些转换,从而经历相同的数值序列。而“主动-被动”方法则会有一个主节点执行转换并记录结果,比如“1”、“3”、“6”等。这个例子也清楚地说明了为什么顺序对于确保副本间的一致性至关重要:重新排列加法和乘法的顺序会导致不同的结果。

The distributed log can be seen as the data structure which models the problem of consensus. A log, after all, represents a series of decisions on the "next" value to append. You have to squint a little to see a log in the Paxos family of algorithms, though log-building is their most common practical application. With Paxos, this is usually done using an extension of the protocol called "multi-paxos", which models the log as a series of consensus problems, one for each slot in the log. The log is much more prominent in other protocols such as ZAB, RAFT, and Viewstamped Replication, which directly model the problem of maintaining a distributed, consistent log.
分布式日志可被视为建模共识问题的数据结构。毕竟,日志体现了一系列关于“下一个”追加值的决策。虽然需要在 Paxos 类算法中稍加想象才能看到日志的影子,但构建日志是其最常见的实际应用场景。在 Paxos 中,通常通过名为“multi-paxos”的协议扩展来实现,该扩展将日志建模为一系列共识问题,每个日志槽位对应一个。而在 ZAB、RAFT 和 Viewstamped Replication 等其他协议中,日志的地位更为突出,这些协议直接建模了维护分布式一致性日志的问题。

My suspicion is that our view of this is a little bit biased by the path of history, perhaps due to the few decades in which the theory of distributed computing outpaced its practical application. In reality, the consensus problem is a bit too simple. Computer systems rarely need to decide a single value, they almost always handle a sequence of requests. So a log, rather than a simple single-value register, is the more natural abstraction.
我怀疑我们对这一问题的看法多少受到历史路径的影响,这可能源于分布式计算理论领先于实际应用的数十年发展。实际上,共识问题略显简单。计算机系统很少需要决定单一值,它们几乎总是处理请求序列。因此,日志——而非简单的单值寄存器——才是更自然的抽象。

Furthermore, the focus on the algorithms obscures the underlying log abstraction systems need. I suspect we will end up focusing more on the log as a commoditized building block irrespective of its implementation in the same way we often talk about a hash table without bothering to get in the details of whether we mean the murmur hash with linear probing or some other variant. The log will become something of a commoditized interface, with many algorithms and implementations competing to provide the best guarantees and optimal performance.
此外,对算法的关注掩盖了系统所需的底层日志抽象。我猜想,我们最终会更倾向于将日志视为一种标准化的构建模块,而不必深究其具体实现细节,就像我们讨论哈希表时,通常不会纠结于它是指采用线性探测的 Murmur 哈希还是其他变种一样。日志将逐渐成为一种标准化的接口,众多算法和实现将竞相提供最佳保证和最优性能。

Changelog 101: Tables and Events are Dual
变更日志 101:表与事件的对偶性

Let's come back to databases for a bit. There is a facinating duality between a log of changes and a table. The log is similar to the list of all credits and debits and bank processes; a table is all the current account balances. If you have a log of changes, you can apply these changes in order to create the table capturing the current state. This table will record the latest state for each key (as of a particular log time). There is a sense in which the log is the more fundamental data structure: in addition to creating the original table you can also transform it to create all kinds of derived tables. (And yes, table can mean keyed data store for the non-relational folks.)
让我们再回到数据库的话题。变更日志与表之间存在着一种迷人的对偶关系。日志类似于银行处理的所有借贷记录列表;而表则是所有账户的当前余额。如果有一系列变更日志,你可以按顺序应用这些变更来创建反映当前状态的表。该表将记录每个键的最新状态(截至日志的特定时间)。从某种意义上说,日志是更基础的数据结构:除了能生成原始表之外,你还可以通过转换日志来创建各种派生表。(对于非关系型数据库用户来说,"表"也可以指键值存储。)

This process works in reverse too: if you have a table taking updates, you can record these changes and publish a "changelog" of all the updates to the state of the table. This changelog is exactly what you need to support near-real-time replicas. So in this sense you can see tables and events as dual: tables support data at rest and logs capture change. The magic of the log is that if it is a complete log of changes, it holds not only the contents of the final version of the table, but also allows recreating all other versions that might have existed. It is, effectively, a sort of backup of every previous state of the table.
这个过程反过来也同样适用:如果你有一个接收更新的表,你可以记录这些变化,并发布一个包含表状态所有更新的“变更日志”。这个变更日志正是支持近实时副本所需的。因此,从这个意义上说,你可以将表和事件视为对偶的:表支持静态数据,而日志捕获变化。日志的神奇之处在于,如果它是一个完整的变更记录,它不仅保存了表的最终版本内容,还允许重建可能存在的所有其他版本。实际上,它就像是表的每一个先前状态的备份。

This might remind you of source code version control. There is a close relationship between source control and databases. Version control solves a very similar problem to what distributed data systems have to solve—managing distributed, concurrent changes in state. A version control system usually models the sequence of patches, which is in effect a log. You interact directly with a checked out "snapshot" of the current code which is analogous to the table. You will note that in version control systems, as in other distributed stateful systems, replication happens via the log: when you update, you pull down just the patches and apply them to your current snapshot.
这可能会让你联想到源代码版本控制系统。版本控制与数据库之间存在着紧密联系。版本控制解决的正是分布式数据系统必须应对的问题——管理分布式、并发的状态变更。版本控制系统通常以补丁序列为模型,本质上这就是一种日志。你直接与之交互的是当前代码的“检出快照”,这类似于数据库中的表。你会注意到,版本控制系统与其他有状态的分布式系统一样,通过日志实现数据复制:当你更新时,只需拉取补丁并应用到当前快照即可。

Some people have seen some of these ideas recently from Datomic, a company selling a log-centric database. This presentation gives a great overview of how they have applied the idea in their system. These ideas are not unique to this system, of course, as they have been a part of the distributed systems and database literature for well over a decade.
最近有些人可能从 Datomic 公司(一家销售以日志为核心的数据库的公司)了解到部分这类理念。该公司的演示文稿精彩概述了他们如何将这些理念应用于自身系统。当然,这些思想并非该系统独有,它们作为分布式系统和数据库文献的一部分已有十多年历史。

This may all seem a little theoretical. Do not despair! We'll get to practical stuff pretty quickly.
这一切可能看起来有些理论化。但别灰心!我们很快就会讲到实际内容。

What's next  接下来是什么

In the remainder of this article I will try to give a flavor of what a log is good for that goes beyond the internals of distributed computing or abstract distributed computing models. This includes:
在本文剩余部分,我将尝试阐述日志的实用价值,这些价值超越了分布式计算内部机制或抽象分布式计算模型的范畴。主要包括:

  1. Data Integration—Making all of an organization's data easily available in all its storage and processing systems.
    数据集成——让组织的所有数据在其存储和处理系统中都能轻松可用。
  2. Real-time data processing—Computing derived data streams.
    实时数据处理——计算衍生数据流。
  3. Distributed system design—How practical systems can by simplified with a log-centric design.
    分布式系统设计——如何通过以日志为核心的设计简化实际系统。

These uses all resolve around the idea of a log as a stand-alone service.
这些应用都围绕将日志作为独立服务这一核心理念展开。

In each case, the usefulness of the log comes from simple function that the log provides: producing a persistent, re-playable record of history. Surprisingly, at the core of these problems is the ability to have many machines playback history at their own rate in a deterministic manner.
在每种场景中,日志的价值源于其提供的基础功能:生成持久化、可重复播放的历史记录。令人惊讶的是,这些问题的核心在于让多台机器能够以各自速率、确定性地回放历史数据。

Part Two: Data Integration
第二部分:数据集成

Let me first say what I mean by "data integration" and why I think it's important, then we'll see how it relates back to logs.
首先让我解释一下“数据集成”的含义及其重要性,之后我们会探讨它与日志的关联。

Data integration is making all the data an organization has available in all its services and systems.
数据集成是指将组织拥有的所有数据在其所有服务和系统中实现可用性。

This phrase "data integration" isn't all that common, but I don't know a better one. The more recognizable term ETL usually covers only a limited part of data integration—populating a relational data warehouse. But much of what I am describing can be thought of as ETL generalized to cover real-time systems and processing flows.
“数据集成”这一术语并不十分常见,但我找不到更贴切的表述。更广为人知的 ETL(提取、转换、加载)通常仅涵盖数据集成的有限部分——即关系型数据仓库的填充。但本文讨论的内容可视为 ETL 的广义扩展,涵盖实时系统与处理流程。

You don't hear much about data integration in all the breathless interest and hype around the idea of big data, but nonetheless, I believe this mundane problem of "making the data available" is one of the more valuable things an organization can focus on.
在大数据概念引发的狂热关注与炒作中,数据集成的话题鲜少被提及。然而,我认为这个看似平凡的“数据可用性”问题,恰恰是组织最应关注的价值点之一。

Effective use of data follows a kind of Maslow's hierarchy of needs. The base of the pyramid involves capturing all the relevant data, being able to put it together in an applicable processing environment (be that a fancy real-time query system or just text files and python scripts). This data needs to be modeled in a uniform way to make it easy to read and process. Once these basic needs of capturing data in a uniform way are taken care of it is reasonable to work on infrastructure to process this data in various ways—MapReduce, real-time query systems, etc.
数据的有效运用遵循类似马斯洛需求层次的理论。金字塔的底层涉及捕获所有相关数据,并将其整合到适用的处理环境中(无论是先进的实时查询系统,还是简单的文本文件和 Python 脚本)。这些数据需要通过统一的建模方式实现便捷读取和处理。只有当这些基础需求——以统一形式捕获数据——得到满足后,才有条件构建各种数据处理基础设施,如 MapReduce、实时查询系统等。

It's worth noting the obvious: without a reliable and complete data flow, a Hadoop cluster is little more than a very expensive and difficult to assemble space heater. Once data and processing are available, one can move concern on to more refined problems of good data models and consistent well understood semantics. Finally, concentration can shift to more sophisticated processing—better visualization, reporting, and algorithmic processing and prediction.
值得指出一个显而易见的事实:若没有可靠且完整的数据流,Hadoop 集群不过是一台组装难度极高且造价昂贵的空间加热器。一旦数据和处理能力就位,人们才能将注意力转向更精细的问题——优质的数据模型以及一致且易于理解的语义。最终,重心可转移至更复杂的处理环节——更好的可视化、报表系统以及算法处理和预测。

In my experience, most organizations have huge holes in the base of this pyramid—they lack reliable complete data flow—but want to jump directly to advanced data modeling techniques. This is completely backwards.
根据我的经验,大多数组织在这个金字塔的基座部分存在巨大漏洞——他们缺乏可靠完整的数据流——却想直接跃升至高级数据建模技术。这完全是本末倒置。

So the question is, how can we build reliable data flow throughout all the data systems in an organization?
因此问题在于:我们如何在整个组织的数据系统中构建可靠的数据流?

Data Integration: Two complications
数据集成:两大难点

Two trends make data integration harder.
两大趋势使得数据整合变得更加困难。

The event data firehose
事件数据的洪流

The first trend is the rise of event data. Event data records things that happen rather than things that are. In web systems, this means user activity logging, but also the machine-level events and statistics required to reliably operate and monitor a data center's worth of machines. People tend to call this "log data" since it is often written to application logs, but that confuses form with function. This data is at the heart of the modern web: Google's fortune, after all, is generated by a relevance pipeline built on clicks and impressions—that is, events.
第一个趋势是事件数据的兴起。事件数据记录的是发生的事情,而非存在的状态。在网络系统中,这既包括用户活动日志,也包括可靠运营和监控数据中心规模机器所需的机器级事件与统计信息。人们常称之为"日志数据",因其常被写入应用日志,但这混淆了形式与功能。此类数据是现代网络的核心:毕竟,谷歌的财富正是建立在点击与展示——即事件——构成的相关性管道之上。

And this stuff isn't limited to web companies, it's just that web companies are already fully digital, so they are easier to instrument. Financial data has long been event-centric. RFID adds this kind of tracking to physical objects. I think this trend will continue with the digitization of traditional businesses and activities.
这种现象不仅限于网络公司,只不过网络公司已完全数字化,因此更易于进行监测。金融数据长期以来就是以事件为中心的。RFID 技术将这种追踪方式应用于实体物品。我认为随着传统业务和活动的数字化转型,这一趋势将持续发展。

This type of event data records what happened, and tends to be several orders of magnitude larger than traditional database uses. This presents significant challenges for processing.
这类事件数据记录了所发生的情况,其规模往往比传统数据库应用高出几个数量级,这给数据处理带来了重大挑战。

The explosion of specialized data systems
专业化数据系统的激增

The second trend comes from the explosion of specialized data systems that have become popular and often freely available in the last five years. Specialized systems exist for OLAP, search, simple online storage, batch processing, graph analysis, and so on.
第二个趋势源于过去五年中流行且通常免费提供的专业化数据系统的爆发式增长。针对 OLAP、搜索、简单在线存储、批处理、图形分析等领域均出现了专用系统。

The combination of more data of more varieties and a desire to get this data into more systems leads to a huge data integration problem.
数据量更大、种类更多与希望将这些数据接入更多系统的需求相结合,导致了一个巨大的数据集成问题。

Log-structured data flow
日志结构化的数据流

The log is the natural data structure for handling data flow between systems. The recipe is very simple:
日志是处理系统间数据流的天然数据结构。方法非常简单:
Take all the organization's data and put it into a central log for real-time subscription.
将组织的所有数据集中到一个中央日志中,以便实时订阅。

Each logical data source can be modeled as its own log. A data source could be an application that logs out events (say clicks or page views), or a database table that accepts modifications. Each subscribing system reads from this log as quickly as it can, applies each new record to its own store, and advances its position in the log. Subscribers could be any kind of data system—a cache, Hadoop, another database in another site, a search system, etc.
每个逻辑数据源都可以被建模为自身的日志。数据源可以是一个记录事件(如点击或页面浏览)的应用程序,也可以是一个接受修改的数据库表。每个订阅系统尽可能快地从该日志中读取数据,将每条新记录应用到自己的存储中,并推进其在日志中的位置。订阅者可以是任何类型的数据系统——缓存、Hadoop、另一个站点的另一个数据库、搜索系统等。

For example, the log concept gives a logical clock for each change against which all subscribers can be measured. This makes reasoning about the state of the different subscriber systems with respect to one another far simpler, as each has a "point in time" they have read up to.
例如,日志概念为每个变更提供了一个逻辑时钟,所有订阅者都可以据此进行衡量。这使得对不同订阅系统之间状态的推理变得简单得多,因为每个系统都有一个它们读取到的“时间点”。

To make this more concrete, consider a simple case where there is a database and a collection of caching servers. The log provides a way to synchronize the updates to all these systems and reason about the point of time of each of these systems. Let's say we write a record with log entry X and then need to do a read from the cache. If we want to guarantee we don't see stale data, we just need to ensure we don't read from any cache which has not replicated up to X.
为了使这一点更具体,考虑一个简单的场景:有一个数据库和一组缓存服务器。日志提供了一种方法来同步所有这些系统的更新,并推断每个系统的时间点。假设我们写入了一条记录,日志条目为 X,然后需要从缓存中读取。如果我们想确保不会看到过时的数据,只需确保不从任何尚未复制到 X 的缓存中读取即可。

The log also acts as a buffer that makes data production asynchronous from data consumption. This is important for a lot of reasons, but particularly when there are multiple subscribers that may consume at different rates. This means a subscribing system can crash or go down for maintenance and catch up when it comes back: the subscriber consumes at a pace it controls. A batch system such as Hadoop or a data warehouse may consume only hourly or daily, whereas a real-time query system may need to be up-to-the-second. Neither the originating data source nor the log has knowledge of the various data destination systems, so consumer systems can be added and removed with no change in the pipeline.
日志还充当缓冲区,使数据生产与数据消费异步进行。这一点非常重要,原因有很多,尤其是当存在多个可能以不同速率消费的订阅者时。这意味着订阅系统可以崩溃或进行维护停机,并在恢复时追赶进度:订阅者以自身控制的节奏消费数据。像 Hadoop 这样的批处理系统或数据仓库可能仅每小时或每天消费一次,而实时查询系统可能需要秒级更新。原始数据源和日志都不了解各种数据目标系统,因此可以在不更改管道的情况下添加或移除消费者系统。

Of particular importance: the destination system only knows about the log and not any details of the system of origin. The consumer system need not concern itself with whether the data came from an RDBMS, a new-fangled key-value store, or was generated without a real-time query system of any kind. This seems like a minor point, but is in fact critical.
尤为关键的是:目标系统仅需了解日志本身,而无需知晓原始系统的任何细节。消费系统不必关心数据是来自关系型数据库、新潮的键值存储,还是由无实时查询功能的系统生成。这一点看似微小,实则至关重要。

I use the term "log" here instead of "messaging system" or "pub sub" because it is a lot more specific about semantics and a much closer description of what you need in a practical implementation to support data replication. I have found that "publish subscribe" doesn't imply much more than indirect addressing of messages—if you compare any two messaging systems promising publish-subscribe, you find that they guarantee very different things, and most models are not useful in this domain. You can think of the log as acting as a kind of messaging system with durability guarantees and strong ordering semantics. In distributed systems, this model of communication sometimes goes by the (somewhat terrible) name of atomic broadcast.
此处我使用"日志"而非"消息系统"或"发布订阅",是因为它能更精确地描述语义,也更贴近实际实现数据复制所需的功能。我发现"发布订阅"除了暗示消息的间接寻址外,几乎不包含更多具体含义——若比较任何两种承诺提供发布订阅功能的消息系统,会发现它们保证的特性大相径庭,且大多数模型在此领域并不实用。你可以将日志视为一种具备持久化保证和强顺序语义的消息系统。在分布式系统中,这种通信模型有时被称为(略显糟糕的)原子广播。

It's worth emphasizing that the log is still just the infrastructure. That isn't the end of the story of mastering data flow: the rest of the story is around metadata, schemas, compatibility, and all the details of handling data structure and evolution. But until there is a reliable, general way of handling the mechanics of data flow, the semantic details are secondary.
需要强调的是,日志仍然只是基础设施。这并非掌握数据流故事的全部:其余部分涉及元数据、模式、兼容性以及处理数据结构和演化的所有细节。但在拥有可靠、通用的方式来处理数据流机制之前,语义细节都是次要的。

At LinkedIn  在领英

I got to watch this data integration problem emerge in fast-forward as LinkedIn moved from a centralized relational database to a collection of distributed systems.
我有幸目睹了 LinkedIn 从集中式关系型数据库转向分布式系统集合时,这个数据整合问题如何加速显现。

These days our major data systems include:
如今我们的主要数据系统包括:

Each of these is a specialized distributed system that provides advanced functionality in its area of specialty.
每一个都是特定领域的专业化分布式系统,在其专业范围内提供高级功能。

This idea of using logs for data flow has been floating around LinkedIn since even before I got here. One of the earliest pieces of infrastructure we developed was a service called databus that provided a log caching abstraction on top of our early Oracle tables to scale subscription to database changes so we could feed our social graph and search indexes.
这种利用日志实现数据流动的理念,早在我加入领英之前就已在此萌芽。我们早期开发的基础设施之一是一项名为 databus 的服务,它在 Oracle 表之上构建了日志缓存抽象层,用以扩展对数据库变更的订阅能力,从而为我们的社交图谱和搜索索引提供数据支持。

I'll give a little bit of the history to provide context. My own involvement in this started around 2008 after we had shipped our key-value store. My next project was to try to get a working Hadoop setup going, and move some of our recommendation processes there. Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy prediction algorithms. So began a long slog.
我将简要回顾历史以提供背景。我个人的参与始于 2008 年,当时我们刚交付了键值存储系统。我的下一个项目是尝试搭建可运行的 Hadoop 环境,并将部分推荐流程迁移至此。由于缺乏相关经验,我们自然预留了几周时间处理数据导入导出,其余时间则用于实现复杂的预测算法。漫长的攻坚就此开始。

We originally planned to just scrape the data out of our existing Oracle data warehouse. The first discovery was that getting data out of Oracle quickly is something of a dark art. Worse, the data warehouse processing was not appropriate for the production batch processing we planned for Hadoop—much of the processing was non-reversable and specific to the reporting being done. We ended up avoiding the data warehouse and going directly to source databases and log files. Finally, we implemented another pipeline to load data into our key-value store for serving results.
我们原本计划直接从现有的 Oracle 数据仓库中抓取数据。第一个发现是,快速从 Oracle 中提取数据堪称一门玄学。更糟的是,数据仓库的处理方式并不适合我们为 Hadoop 规划的生产批处理——大部分处理过程不可逆且专为报表设计。最终我们绕开了数据仓库,直接对接源数据库和日志文件。此外,我们还实现了另一条管道,将数据加载到键值存储中以供结果查询。

This mundane data copying ended up being one of the dominate items for the original development. Worse, any time there was a problem in any of the pipelines, the Hadoop system was largely useless—running fancy algorithms on bad data just produces more bad data.
这种平凡的数据复制工作最终成为初期开发的主要任务之一。更糟糕的是,只要任何一条管道出现问题,Hadoop 系统基本就形同虚设——对错误数据运行复杂算法只会产生更多错误数据。

Although we had built things in a fairly generic way, each new data source required custom configuration to set up. It also proved to be the source of a huge number of errors and failures. The site features we had implemented on Hadoop became popular and we found ourselves with a long list of interested engineers. Each user had a list of systems they wanted integration with and a long list of new data feeds they wanted.
尽管我们以相当通用的方式构建了这些组件,但每个新数据源都需要自定义配置来设置。这也被证明是大量错误和故障的根源。我们在 Hadoop 上实现的网站功能变得流行起来,发现自己面前有一长串感兴趣的工程师。每个用户都有一份他们希望集成的系统清单,以及一长串他们想要的新数据源。

ETL in Ancient Greece. Not much has changed.
古希腊时期的 ETL(提取、转换、加载)。变化不大。

A few things slowly became clear to me.
有几件事逐渐让我明白了。

First, the pipelines we had built, though a bit of a mess, were actually extremely valuable. Just the process of making data available in a new processing system (Hadoop) unlocked a lot of possibilities. New computation was possible on the data that would have been hard to do before. Many new products and analysis just came from putting together multiple pieces of data that had previously been locked up in specialized systems.
首先,我们构建的管道虽然有点混乱,但实际上非常有价值。仅仅是将数据提供给一个新的处理系统(Hadoop)这一过程,就解锁了许多可能性。现在可以对数据进行以前难以实现的新计算。许多新产品和分析仅仅是通过将之前被锁在专用系统中的多个数据片段组合在一起而产生的。

Second, it was clear that reliable data loads would require deep support from the data pipeline. If we captured all the structure we needed, we could make Hadoop data loads fully automatic, so that no manual effort was expanded adding new data sources or handling schema changes—data would just magically appear in HDFS and Hive tables would automatically be generated for new data sources with the appropriate columns.
其次,显然可靠的数据加载需要数据管道的深度支持。如果我们能捕获所需的所有数据结构,就能实现 Hadoop 数据加载的全自动化——无需人工干预添加新数据源或处理模式变更,数据会自动出现在 HDFS 中,同时会为新数据源自动生成带有合适列的 Hive 表。

Third, we still had very low data coverage. That is, if you looked at the overall percentage of the data LinkedIn had that was available in Hadoop, it was still very incomplete. And getting to completion was not going to be easy given the amount of effort required to operationalize each new data source.
第三,我们的数据覆盖率仍然很低。也就是说,纵观 LinkedIn 拥有的数据总量,其中可供 Hadoop 使用的比例仍然很不完整。鉴于每个新数据源上线所需的工作量,要实现全面覆盖并非易事。

The way we had been proceeding, building out custom data loads for each data source and destination, was clearly infeasible. We had dozens of data systems and data repositories. Connecting all of these would have lead to building custom piping between each pair of systems something like this:
我们之前采用的方式——为每个数据源和目标构建定制化数据加载流程——显然是不可行的。当时我们拥有数十个数据系统和存储库,若要将它们全部连接起来,就意味着需要在每两个系统之间搭建类似这样的定制化传输管道:

Note that data often flows in both directions, as many systems (databases, Hadoop) are both sources and destinations for data transfer. This meant we would end up building two pipelines per system: one to get data in and one to get data out.
需要注意的是,数据流通常是双向的,因为许多系统(如数据库、Hadoop)既是数据传输的源头也是目的地。这意味着我们最终需要为每个系统构建两条管道:一条用于数据输入,另一条用于数据输出。

This clearly would take an army of people to build and would never be operable. As we approached fully connectivity we would end up with something like O(N2) pipelines.
显然,这将需要投入大量人力来构建,且永远无法有效运维。随着系统间逐步实现全连接,最终我们将面临类似 O(N²)数量级的管道规模。

Instead, we needed something generic like this:
因此,我们需要的是这样一种通用解决方案:

As much as possible, we needed to isolate each consumer from the source of the data. They should ideally integrate with just a single data repository that would give them access to everything.
我们尽可能需要将每个消费者与数据源隔离开来。理想情况下,它们应仅与单一数据存储库集成,从而获取所有数据的访问权限。

The idea is that adding a new data system—be it a data source or a data destination—should create integration work only to connect it to a single pipeline instead of each consumer of data.
其核心理念在于,新增一个数据系统——无论是数据源还是数据目的地——只需将其连接到单一管道即可完成集成工作,而无需为每个数据消费者单独对接。

This experience lead me to focus on building Kafka to combine what we had seen in messaging systems with the log concept popular in databases and distributed system internals. We wanted something to act as a central pipeline first for all activity data, and eventually for many other uses, including data deployment out of Hadoop, monitoring data, etc.
这一经历促使我专注于开发 Kafka,将我们在消息系统中观察到的特性与数据库及分布式系统内部流行的日志概念相结合。我们希望能构建一个首先作为所有活动数据中央管道的系统,并最终扩展至包括从 Hadoop 部署数据、监控数据等更多用途。

For a long time, Kafka was a little unique (some would say odd) as an infrastructure product—neither a database nor a log file collection system nor a traditional messaging system. But recently Amazon has offered a service that is very very similar to Kafka called Kinesis. The similarity goes right down to the way partitioning is handled, data is retained, and the fairly odd split in the Kafka API between high- and low-level consumers. I was pretty happy about this. A sign you've created a good infrastructure abstraction is that AWS offers it as a service! Their vision for this seems to be exactly similar to what I am describing: it is the piping that connects all their distributed systems—DynamoDB, RedShift, S3, etc.—as well as the basis for distributed stream processing using EC2.
长久以来,Kafka 作为一种基础设施产品显得有点独特(有人甚至会说怪异)——它既非数据库,亦非日志文件收集系统,更不是传统的消息系统。但最近亚马逊推出了一项与 Kafka 极为相似的服务 Kinesis,相似之处深入细节:从分区处理方式、数据保留机制,到 Kafka API 中高低阶消费者之间颇为特殊的划分皆是如此。对此我感到非常欣喜。一个优秀的底层抽象设计得到验证的标志,就是 AWS 将其作为服务推出!他们的愿景与我描述的完全一致:这是连接其所有分布式系统(如 DynamoDB、RedShift、S3 等)的管道,也是基于 EC2 实现分布式流处理的基础。

Relationship to ETL and the Data Warehouse
与 ETL 及数据仓库的关系

Let's talk data warehousing for a bit. The data warehouse is meant to be a repository of the clean, integrated data structured to support analysis. This is a great idea. For those not in the know, the data warehousing methodology involves periodically extracting data from source databases, munging it into some kind of understandable form, and loading it into a central data warehouse. Having this central location that contains a clean copy of all your data is a hugely valuable asset for data-intensive analysis and processing. At a high level, this methodology doesn't change too much whether you use a traditional data warehouse like Oracle or Teradata or Hadoop, though you might switch up the order of loading and munging.
让我们先聊聊数据仓库。数据仓库旨在作为一个存储经过清洗、集成的数据的知识库,其结构设计用于支持分析。这个理念非常棒。对于不太了解的人来说,数据仓库方法论包括定期从源数据库提取数据,将其整理成某种可理解的形态,然后加载到中央数据仓库中。拥有这样一个包含所有数据干净副本的中央存储库,对于数据密集型分析和处理而言是极具价值的资产。从高层次看,无论您使用的是传统数据仓库如 Oracle 或 Teradata,还是 Hadoop,这套方法论都不会有太大变化,尽管加载和整理的顺序可能有所调整。

A data warehouse containing clean, integrated data is a phenomenal asset, but the mechanics of getting this are a bit out of date.
包含清洁、集成数据的数据仓库是一项非凡的资产,但实现这一目标的机制已略显过时。

The key problem for a data-centric organization is coupling the clean integrated data to the data warehouse. A data warehouse is a piece of batch query infrastructure which is well suited to many kinds of reporting and ad hoc analysis, particularly when the queries involve simple counting, aggregation, and filtering. But having a batch system be the only repository of clean complete data means the data is unavailable for systems requiring a real-time feed—real-time processing, search indexing, monitoring systems, etc.
数据驱动型组织的核心问题在于如何将清洗整合后的数据与数据仓库解耦。数据仓库是一种适合多种报表生成和即席查询的批处理查询基础设施,尤其当查询涉及简单计数、聚合和过滤时表现优异。但若仅依赖批处理系统作为唯一存储清洁完整数据的仓库,将导致这些数据无法满足需要实时数据流的系统需求——如实时处理、搜索索引构建、监控系统等。

In my view, ETL is really two things. First, it is an extraction and data cleanup process—essentially liberating data locked up in a variety of systems in the organization and removing an system-specific non-sense. Secondly, that data is restructured for data warehousing queries (i.e. made to fit the type system of a relational DB, forced into a star or snowflake schema, perhaps broken up into a high performance column format, etc). Conflating these two things is a problem. The clean, integrated repository of data should be available in real-time as well for low-latency processing as well as indexing in other real-time storage systems.
在我看来,ETL(抽取-转换-加载)实质包含两个层面。首先,它是数据抽取与清洗的过程,旨在从组织内各类系统中释放被锁定的数据,并剔除系统特有的无效信息。其次,该数据会为适应数据仓库查询需求进行重构(例如匹配关系型数据库的类型系统、强制转换为星型或雪花模型模式,或拆分为高性能列存储格式等)。将这两个层面混为一谈会产生问题。清洁且整合后的数据存储库应当同样支持实时访问,既能满足低延迟处理需求,也能为其他实时存储系统建立索引。

I think this has the added benefit of making data warehousing ETL much more organizationally scalable. The classic problem of the data warehouse team is that they are responsible for collecting and cleaning all the data generated by every other team in the organization. The incentives are not aligned: data producers are often not very aware of the use of the data in the data warehouse and end up creating data that is hard to extract or requires heavy, hard to scale transformation to get into usable form. Of course, the central team never quite manages to scale to match the pace of the rest of the organization, so data coverage is always spotty, data flow is fragile, and changes are slow.
我认为这还有一个额外的好处,就是让数据仓库的 ETL(提取、转换、加载)在组织层面上更具可扩展性。数据仓库团队面临的经典问题是,他们需要负责收集和清理组织中其他所有团队生成的数据。激励机制并不一致:数据生产者通常不太了解数据在数据仓库中的用途,最终产生的数据要么难以提取,要么需要大量难以扩展的转换才能变成可用形式。当然,中央团队永远无法完全跟上组织其他部分的步伐,因此数据覆盖总是参差不齐,数据流脆弱不堪,变更也进展缓慢。

A better approach is to have a central pipeline, the log, with a well defined API for adding data. The responsibility of integrating with this pipeline and providing a clean, well-structured data feed lies with the producer of this data feed. This means that as part of their system design and implementation they must consider the problem of getting data out and into a well structured form for delivery to the central pipeline. The addition of new storage systems is of no consequence to the data warehouse team as they have a central point of integration. The data warehouse team handles only the simpler problem of loading structured feeds of data from the central log and carrying out transformation specific to their system.
更好的方法是采用一个中央管道——日志,并为其定义明确的 API 以添加数据。与这一管道集成并提供干净、结构良好的数据流的责任在于数据流的生产者。这意味着作为其系统设计和实现的一部分,他们必须考虑如何将数据提取出来并转化为结构良好的形式,以便交付给中央管道。新增存储系统对数据仓库团队没有影响,因为他们有一个集中的集成点。数据仓库团队只需处理从中央日志加载结构化数据流并执行特定于其系统的转换这一更简单的问题。

This point about organizational scalability becomes particularly important when one considers adopting additional data systems beyond a traditional data warehouse. Say, for example, that one wishes to provide search capabilities over the complete data set of the organization. Or, say that one wants to provide sub-second monitoring of data streams with real-time trend graphs and alerting. In either of these cases, the infrastructure of the traditional data warehouse or even a Hadoop cluster is going to be inappropriate. Worse, the ETL processing pipeline built to support database loads is likely of no use for feeding these other systems, making bootstrapping these pieces of infrastructure as large an undertaking as adopting a data warehouse. This likely isn't feasible and probably helps explain why most organizations do not have these capabilities easily available for all their data. By contrast, if the organization had built out feeds of uniform, well-structured data, getting any new system full access to all data requires only a single bit of integration plumbing to attach to the pipeline.
当考虑在传统数据仓库之外采用更多数据系统时,组织可扩展性的这一点变得尤为重要。例如,假设有人希望为组织的完整数据集提供搜索功能。或者,有人想要提供对数据流的亚秒级监控,并带有实时趋势图和警报功能。在这两种情况下,传统数据仓库甚至 Hadoop 集群的基础设施都将不适用。更糟糕的是,为支持数据库加载而构建的 ETL 处理管道很可能无法用于向这些其他系统提供数据,这使得启动这些基础设施部分的工作与采用数据仓库一样庞大。这很可能不可行,或许也解释了为什么大多数组织无法轻松地为所有数据提供这些功能。相比之下,如果组织构建了统一、结构良好的数据馈送,那么让任何新系统完全访问所有数据只需要一小部分集成工作来连接到管道。

This architecture also raises a set of different options for where a particular cleanup or transformation can reside:
这种架构还为特定的清理或转换任务提供了多种不同的实现位置选择:

  1. It can be done by the data producer prior to adding the data to the company wide log.
    可以由数据生产者在将数据添加到公司全局日志之前完成。
  2. It can be done as a real-time transformation on the log (which in turn produces a new, transformed log)
    可以作为日志的实时转换来完成(进而生成一个新的、经过转换的日志)。
  3. It can be done as part of the load process into some destination data system
    可以在加载到某些目标数据系统的过程中完成。

The best model is to have cleanup done prior to publishing the data to the log by the publisher of the data. This means ensuring the data is in a canonical form and doesn't retain any hold-overs from the particular code that produced it or the storage system in which it may have been maintained. These details are best handled by the team that creates the data since they know the most about their own data. Any logic applied in this stage should be lossless and reversible.
最佳实践是在数据发布者将数据发布到日志之前完成清理工作。这意味着要确保数据采用规范形式,不保留任何来自特定生产代码或存储系统的遗留痕迹。这些细节最好由创建数据的团队处理,因为他们最了解自己的数据。在此阶段应用的任何逻辑应是无损且可逆的。

Any kind of value-added transformation that can be done in real-time should be done as post-processing on the raw log feed produced. This would include things like sessionization of event data, or the addition of other derived fields that are of general interest. The original log is still available, but this real-time processing produces a derived log containing augmented data.
任何能够实时完成的增值转换都应作为对原始日志流的后处理进行。这包括诸如事件数据的会话化处理,或添加其他普遍感兴趣的派生字段等操作。原始日志仍然保留,而实时处理则生成包含增强数据的派生日志。

Finally, only aggregation that is specific to the destination system should be performed as part of the loading process. This might include transforming data into a particular star or snowflake schema for analysis and reporting in a data warehouse. Because this stage, which most naturally maps to the traditional ETL process, is now done on a far cleaner and more uniform set of streams, it should be much simplified.
最后,只有针对目标系统特有的聚合操作才应在加载过程中执行。这可能包括将数据转换为特定的星型或雪花模式,以便在数据仓库中进行分析和报告。由于这一阶段(最自然地映射到传统的 ETL 流程)现在是在更干净、更统一的流数据集上完成的,因此应该会大幅简化。

Log Files and Events
日志文件与事件

Let's talk a little bit about a side benefit of this architecture: it enables decoupled, event-driven systems.
让我们谈谈这种架构的一个附带好处:它能够实现解耦的事件驱动系统。

The typical approach to activity data in the web industry is to log it out to text files where it can be scrapped into a data warehouse or into Hadoop for aggregation and querying. The problem with this is the same as the problem with all batch ETL: it couples the data flow to the data warehouse's capabilities and processing schedule.
网络行业处理活动数据的典型方法是将其记录到文本文件中,然后可以将其导入数据仓库或 Hadoop 进行聚合和查询。这种方法的问题与所有批处理 ETL 相同:它将数据流与数据仓库的能力和处理计划耦合在一起。

At LinkedIn, we have built our event data handling in a log-centric fashion. We are using Kafka as the central, multi-subscriber event log. We have defined several hundred event types, each capturing the unique attributes about a particular type of action. This covers everything from page views, ad impressions, and searches, to service invocations and application exceptions.
在 LinkedIn,我们以日志为核心构建了事件数据处理体系。我们采用 Kafka 作为中心化的多订阅者事件日志,定义了数百种事件类型,每种类型都捕获特定动作的独特属性。这涵盖了从页面浏览、广告展示、搜索,到服务调用和应用程序异常等所有场景。

To understand the advantages of this, imagine a simple event—showing a job posting on the job page. The job page should contain only the logic required to display the job. However, in a fairly dynamic site, this could easily become larded up with additional logic unrelated to showing the job. For example let's say we need to integrate the following systems:
要理解这种方式的优势,可以想象一个简单事件——在职位页面展示职位信息。该页面应仅包含展示职位所需的逻辑。但在一个高度动态的网站中,这很容易混杂进与展示职位无关的额外逻辑。例如我们需要集成以下系统:

  1. We need to send this data to Hadoop and data warehouse for offline processing purposes
    我们需要将这些数据发送到 Hadoop 和数据仓库以进行离线处理
  2. We need to count the view to ensure that the viewer is not attempting some kind of content scraping
    我们需要统计浏览次数以确保浏览者没有尝试内容爬取行为
  3. We need to aggregate this view for display in the Job poster's analytics page
    我们需要汇总此视图以便在职位发布者的分析页面展示
  4. We need to record the view to ensure we properly impression cap any job recommendations for that user (we don't want to show the same thing over and over)
    我们需要记录该次浏览以确保对用户的职位推荐进行正确的曝光频次控制(避免重复展示相同内容)
  5. Our recommendation system may need to record the view to correctly track the popularity of that job
    我们的推荐系统可能需要记录该次浏览以准确追踪该职位的受欢迎程度
  6. Etc   等等

Pretty soon, the simple act of displaying a job has become quite complex. And as we add other places where jobs are displayed—mobile applications, and so on—this logic must be carried over and the complexity increases. Worse, the systems that we need to interface with are now somewhat intertwined—the person working on displaying jobs needs to know about many other systems and features and make sure they are integrated properly. This is just a toy version of the problem, any real application would be more, not less, complex.
很快,简单地展示一个职位的行为就变得相当复杂。随着我们在其他展示职位的地方(如移动应用等)添加这一功能,这种逻辑必须被延续,复杂性也随之增加。更糟的是,我们需要对接的系统现在有些纠缠不清——负责展示职位的人员需要了解许多其他系统和功能,并确保它们被正确集成。这只是一个简化版的问题,任何实际应用都会更加复杂而非简化。

The "event-driven" style provides an approach to simplifying this. The job display page now just shows a job and records the fact that a job was shown along with the relevant attributes of the job, the viewer, and any other useful facts about the display of the job. Each of the other interested systems—the recommendation system, the security system, the job poster analytics system, and the data warehouse—all just subscribe to the feed and do their processing. The display code need not be aware of these other systems, and needn't be changed if a new data consumer is added.
“事件驱动”风格为简化这一问题提供了一种方法。职位展示页面现在仅显示职位,并记录职位被展示的事实,以及职位、查看者和其他有关展示的有用属性的相关信息。其他相关系统——推荐系统、安全系统、职位发布者分析系统和数据仓库——都只需订阅这一数据流并进行处理。展示代码无需了解这些其他系统,且在添加新的数据消费者时也无需更改。

Building a Scalable Log
构建可扩展的日志

Of course, separating publishers from subscribers is nothing new. But if you want to keep a commit log that acts as a multi-subscriber real-time journal of everything happening on a consumer-scale website, scalability will be a primary challenge. Using a log as a universal integration mechanism is never going to be more than an elegant fantasy if we can't build a log that is fast, cheap, and scalable enough to make this practical at scale.
当然,将发布者与订阅者分离并非新鲜事。但若想维护一个作为多订阅者实时日志的提交日志,记录消费级网站上发生的一切,可扩展性将成为主要挑战。如果我们无法构建一个足够快速、廉价且可扩展的日志系统,使其在大规模场景下切实可行,那么将日志作为通用集成机制就永远只能是一个优雅的幻想。

Systems people typically think of a distributed log as a slow, heavy-weight abstraction (and usually associate it only with the kind of "metadata" uses for which Zookeeper might be appropriate). But with a thoughtful implementation focused on journaling large data streams, this need not be true. At LinkedIn we are currently running over 60 billion unique message writes through Kafka per day (several hundred billion if you count the writes from mirroring between datacenters).
系统工程师通常认为分布式日志是一种缓慢、重量级的抽象(且往往仅将其与 Zookeeper 适用的“元数据”类用途关联)。但通过专注于大数据流日志记录的精心实现,这一认知未必成立。在 LinkedIn,我们目前每天通过 Kafka 处理超过 600 亿条独立消息写入(若计入数据中心间镜像复制的写入,则达数千亿条)。

We used a few tricks in Kafka to support this kind of scale:
我们在 Kafka 中采用了几项技巧来支撑这种规模:

  1. Partitioning the log   对日志进行分区
  2. Optimizing throughput by batching reads and writes
    通过批量读写优化吞吐量
  3. Avoiding needless data copies
    避免不必要的数据拷贝

In order to allow horizontal scaling we chop up our log into partitions:
为了实现水平扩展,我们将日志切分为多个分区:

Each partition is a totally ordered log, but there is no global ordering between partitions (other than perhaps some wall-clock time you might include in your messages). The assignment of the messages to a particular partition is controllable by the writer, with most users choosing to partition by some kind of key (e.g. user id). Partitioning allows log appends to occur without co-ordination between shards and allows the throughput of the system to scale linearly with the Kafka cluster size.
每个分区都是一个完全有序的日志,但分区之间没有全局顺序(除了可能在消息中包含的挂钟时间)。消息分配到特定分区的方式由写入者控制,大多数用户会选择按某种键(如用户 ID)进行分区。分区机制使得日志追加操作无需分片间协调,并能让系统吞吐量随 Kafka 集群规模线性扩展。

Each partition is replicated across a configurable number of replicas, each of which has an identical copy of the partition's log. At any time, a single one of them will act as the leader; if the leader fails, one of the replicas will take over as leader.
每个分区会在可配置数量的副本间进行复制,每个副本都拥有该分区日志的完全相同副本。任何时候,其中一个副本会作为领导者;若领导者失效,其他副本之一将接替成为新领导者。

Lack of a global order across partitions is a limitation, but we have not found it to be a major one. Indeed, interaction with the log typically comes from hundreds or thousands of distinct processes so it is not meaningful to talk about a total order over their behavior. Instead, the guarantees that we provide are that each partition is order preserving, and Kafka guarantees that appends to a particular partition from a single sender will be delivered in the order they are sent.
跨分区缺乏全局顺序是一个限制,但我们发现这并非主要问题。实际上,与日志的交互通常来自成百上千个独立进程,因此讨论它们行为的总序并无意义。相反,我们提供的保证是每个分区内部保持顺序,且 Kafka 确保来自同一发送者对特定分区的追加操作会按照发送顺序被投递。

A log, like a filesystem, is easy to optimize for linear read and write patterns. The log can group small reads and writes together into larger, high-throughput operations. Kafka pursues this optimization aggressively. Batching occurs from client to server when sending data, in writes to disk, in replication between servers, in data transfer to consumers, and in acknowledging committed data.
日志如同文件系统,易于针对线性读写模式进行优化。日志可将小型读写操作批量合并为更大的高吞吐量操作。Kafka 在这方面进行了激进优化:批处理发生在客户端到服务器的数据传输中、磁盘写入时、服务器间复制过程中、向消费者传输数据时,以及对已提交数据的确认环节。

Finally, Kafka uses a simple binary format that is maintained between in-memory log, on-disk log, and in network data transfers. This allows us to make use of numerous optimizations including zero-copy data transfer.
最后,Kafka 采用了一种简单的二进制格式,该格式在内存日志、磁盘日志和网络数据传输之间保持一致。这使得我们能够利用包括零拷贝数据传输在内的众多优化。

The cumulative effect of these optimizations is that you can usually write and read data at the rate supported by the disk or network, even while maintaining data sets that vastly exceed memory.
这些优化措施的累积效果是,即使处理的数据集远超内存容量,通常也能以磁盘或网络支持的速度进行读写操作。

This write-up isn't meant to be primarily about Kafka so I won't go into further details. You can read a more detailed overview of LinkedIn's approach here and a thorough overview of Kafka's design here.
本文主旨并非专门讨论 Kafka,因此不再深入细节。您可以在此处阅读 LinkedIn 方法的更详细概述,以及此处关于 Kafka 设计的全面介绍。

Part Three: Logs & Real-time Stream Processing
第三部分:日志与实时流处理

So far, I have only described what amounts to a fancy method of copying data from place-to-place. But shlepping bytes between storage systems is not the end of the story. It turns out that "log" is another word for "stream" and logs are at the heart of stream processing.
到目前为止,我所描述的不过是一种在存储系统间搬运数据的高级方法。但字节搬运并非故事的全部。实际上,“日志”与“流”同义,而日志正是流处理的核心所在。

But, wait, what exactly is stream processing?
但是,等等,究竟什么是流处理?

If you are a fan of late 90s and early 2000s database literature or semi-successful data infrastructure products, you likely associate stream processing with efforts to build a SQL engine or "boxes and arrows" interface for event driven processing.
如果你熟悉 90 年代末至 21 世纪初的数据库文献或半成功的数据基础设施产品,你可能会将流处理与构建 SQL 引擎或事件驱动处理的“框线图”界面联系起来。

If you follow the explosion of open source data systems, you likely associate stream processing with some of the systems in this space—for example, Storm, Akka, S4, and Samza. But most people see these as a kind of asynchronous message processing system not that different from a cluster-aware RPC layer (and in fact some things in this space are exactly that).
如果你关注开源数据系统的爆发式发展,你可能会将流处理与此领域的某些系统挂钩——例如 Storm、Akka、S4 和 Samza。但大多数人视这些系统为一种异步消息处理系统,与集群感知的 RPC 层并无本质区别(事实上其中一些系统确实如此)。

Both these views are a little limited. Stream processing has nothing to do with SQL. Nor is it limited to real-time processing. There is no inherent reason you can't process the stream of data from yesterday or a month ago using a variety of different languages to express the computation.
这两种观点都有些局限。流处理与 SQL 并无关联,也不仅限于实时处理。没有本质原因阻止你用各种编程语言来处理昨天或一个月前的数据流以表达计算逻辑。

I see stream processing as something much broader: infrastructure for continuous data processing. I think the computational model can be as general as MapReduce or other distributed processing frameworks, but with the ability to produce low-latency results.
我认为流处理具有更广泛的意义:它是持续数据处理的基础设施。其计算模型可以像 MapReduce 或其他分布式处理框架一样通用,同时具备产出低延迟结果的能力。

The real driver for the processing model is the method of data collection. Data which is collected in batch is naturally processed in batch. When data is collected continuously, it is naturally processed continuously.
处理模型的真正驱动力在于数据采集方式。批量采集的数据自然适合批量处理,而持续采集的数据则适合持续处理。

The US census provides a good example of batch data collection. The census periodically kicks off and does a brute force discovery and enumeration of US citizens by having people walking around door-to-door. This made a lot of sense in 1790 when the census was first begun. Data collection at the time was inherently batch oriented, it involved riding around on horseback and writing down records on paper, then transporting this batch of records to a central location where humans added up all the counts. These days, when you describe the census process one immediately wonders why we don't keep a journal of births and deaths and produce population counts either continuously or with whatever granularity is needed.
美国人口普查是批量数据收集的一个典型案例。普查定期启动,通过工作人员挨家挨户走访,以穷举方式统计美国公民数量。这种模式在 179 年首次开展普查时非常合理——当时的数据收集本质上是批处理式的,需要骑马巡视、纸质记录,再将成批记录送至中央机构由人工汇总。如今若描述这套流程,人们会立即质疑为何不通过持续记录出生死亡日志来动态生成人口统计,或按需获取任意精细度的数据。

This is an extreme example, but many data transfer processes still depend on taking periodic dumps and bulk transfer and integration. The only natural way to process a bulk dump is with a batch process. But as these processes are replaced with continuous feeds, one naturally starts to move towards continuous processing to smooth out the processing resources needed and reduce latency.
这虽是个极端例子,但至今许多数据传输流程仍依赖定期批量转储与集成。处理批量转储数据的唯一自然方式就是批处理。但随着这些流程被实时数据流取代,人们自然会转向持续处理以平摊计算资源消耗并降低延迟。

LinkedIn, for example, has almost no batch data collection at all. The majority of our data is either activity data or database changes, both of which occur continuously. In fact, when you think about any business, the underlying mechanics are almost always a continuous process—events happen in real-time, as Jack Bauer would tell us. When data is collected in batches, it is almost always due to some manual step or lack of digitization or is a historical relic left over from the automation of some non-digital process. Transmitting and reacting to data used to be very slow when the mechanics were mail and humans did the processing. A first pass at automation always retains the form of the original process, so this often lingers for a long time.
以领英为例,几乎不存在批量数据收集的情况。我们的大部分数据要么是活动数据,要么是数据库变更,这两者都是持续产生的。实际上,当你审视任何业务时,其底层机制几乎总是一个连续的过程——事件实时发生,正如杰克·鲍尔会告诉我们的那样。当数据以批量形式收集时,几乎总是由于某些人工步骤、缺乏数字化,或是从某些非数字化流程自动化遗留下来的历史痕迹。在过去,当处理机制依赖邮件传递和人工操作时,数据传输和响应曾非常缓慢。自动化初期的尝试往往会保留原始流程的形式,因此这种现象常常会延续很长时间。

Production "batch" processing jobs that run daily are often effectively mimicking a kind of continuous computation with a window size of one day. The underlying data is, of course, always changing. These were actually so common at LinkedIn (and the mechanics of making them work in Hadoop so tricky) that we implemented a whole framework for managing incremental Hadoop workflows.
那些每日运行的“批量”生产处理作业,本质上是在模拟一种窗口周期为一日的连续计算。当然,底层数据始终在变化。这类情况在领英极为普遍(且在 Hadoop 中实现其运作机制相当复杂),为此我们专门开发了一个完整框架来管理增量式 Hadoop 工作流。

Seen in this light, it is easy to have a different view of stream processing: it is just processing which includes a notion of time in the underlying data being processed and does not require a static snapshot of the data so it can produce output at a user-controlled frequency instead of waiting for the "end" of the data set to be reached. In this sense, stream processing is a generalization of batch processing, and, given the prevalence of real-time data, a very important generalization.
从这个角度来看,很容易对流处理产生不同的理解:它只是处理过程中包含了底层数据的时间概念,不需要数据的静态快照,因此可以按照用户控制的频率输出结果,而无需等待数据集的"终结"。从这个意义上说,流处理是批处理的泛化,鉴于实时数据的普遍性,这是一种非常重要的泛化。

So why has the traditional view of stream processing been as a niche application? I think the biggest reason is that a lack of real-time data collection made continuous processing something of an academic concern.
那么为何传统观点将流处理视为小众应用?我认为最主要的原因是实时数据收集的缺失使得持续处理在某种程度上成为了学术研究的关注点。

I think the lack of real-time data collection is likely what doomed the commercial stream-processing systems. Their customers were still doing file-oriented, daily batch processing for ETL and data integration. Companies building stream processing systems focused on providing processing engines to attach to real-time data streams, but it turned out that at the time very few people actually had real-time data streams. Actually, very early at my career at LinkedIn, a company tried to sell us a very cool stream processing system, but since all our data was collected in hourly files at that time, the best application we could come up with was to pipe the hourly files into the stream system at the end of the hour! They noted that this was a fairly common problem. The exception actually proves the rule here: finance, the one domain where stream processing has met with some success, was exactly the area where real-time data streams were already the norm and processing had become the bottleneck.
我认为,缺乏实时数据收集可能是导致商业流处理系统失败的原因。它们的客户仍在进行面向文件的每日批量处理,用于 ETL 和数据集成。构建流处理系统的公司专注于提供处理引擎以连接实时数据流,但事实证明,当时很少有人真正拥有实时数据流。实际上,在我职业生涯早期,LinkedIn 曾有一家公司试图向我们推销一套非常酷的流处理系统,但由于当时我们所有的数据都是以每小时文件的形式收集的,我们能想到的最佳应用就是在每小时结束时将这些文件导入流处理系统!他们指出这是一个相当普遍的问题。这里的例外恰恰证明了规则:金融领域是流处理取得一定成功的唯一领域,而这正是实时数据流已成为常态且处理成为瓶颈的领域。

Even in the presence of a healthy batch processing ecosystem, I think the actual applicability of stream processing as an infrastructure style is quite broad. I think it covers the gap in infrastructure between real-time request/response services and offline batch processing. For modern internet companies, I think around 25% of their code falls into this category.
即使在健全的批处理生态系统存在的情况下,我认为流式处理作为一种基础设施风格的适用范围实际上相当广泛。它填补了实时请求/响应服务与离线批处理之间的基础设施空白。对于现代互联网企业而言,我认为其约 25%的代码属于这一范畴。

It turns out that the log solves some of the most critical technical problems in stream processing, which I'll describe, but the biggest problem that it solves is just making data available in real-time multi-subscriber data feeds. For those interested in more details, we have open sourced Samza, a stream processing system explicitly built on many of these ideas. We describe a lot of these applications in more detail in the documentation here.
事实证明,日志解决了流处理中一些最关键的技术问题(后续将详述),但最重要的作用是实现了实时多订阅者数据流的可用性。对于想深入了解的读者,我们已开源基于这些理念明确构建的流处理系统 Samza。我们在相关文档中更详细地描述了这些应用场景。

Data flow graphs  数据流图

The most interesting aspect of stream processing has nothing to do with the internals of a stream processing system, but instead has to do with how it extends our idea of what a data feed is from the earlier data integration discussion. We discussed primarily feeds or logs of primary data—the events and rows of data produced in the execution of various applications. But stream processing allows us to also include feeds computed off other feeds. These derived feeds look no different to consumers then the feeds of primary data from which they are computed. These derived feeds can encapsulate arbitrary complexity.
流处理最引人入胜之处,与其系统内部机制无关,而在于它如何扩展了我们早期数据集成讨论中对数据馈送概念的认知。我们主要讨论了主数据馈送或日志——即各类应用执行过程中产生的事件与数据行。但流处理使我们能够纳入基于其他馈送计算得出的衍生馈送。这些衍生馈送对消费者而言与原始主数据馈送毫无二致,却能封装任意复杂度。

Let's dive into this a bit. A stream processing job, for our purposes, will be anything that reads from logs and writes output to logs or other systems. The logs they use for input and output join these processes into a graph of processing stages. Indeed, using a centralized log in this fashion, you can view all the organization's data capture, transformation, and flow as just a series of logs and processes that write to them.
让我们深入探讨这一点。就我们的目的而言,流处理作业可以是任何从日志读取数据并向日志或其他系统输出结果的进程。它们所使用的输入输出日志将这些处理环节连接成处理阶段的关系图。事实上,通过这种集中式日志的运用,你可以将整个组织的数据捕获、转换和流动视作一系列日志及向其写入数据的处理进程。

A stream processor need not have a fancy framework at all: it can be any process or set of processes that read and write from logs, but additional infrastructure and support can be provided for helping manage processing code.
流处理器并不需要多么花哨的框架:它可以是任何从日志中读取和写入的进程或进程组,但可以提供额外的基础设施和支持来帮助管理处理代码。

The purpose of the log in the integration is two-fold.
日志在集成中的作用是双重的。

First, it makes each dataset multi-subscriber and ordered. Recall our "state replication" principle to remember the importance of order. To make this more concrete, consider a stream of updates from a database—if we re-order two updates to the same record in our processing we may produce the wrong final output. This order is more permanent than what is provided by something like TCP as it is not limited to a single point-to-point link and survives beyond process failures and reconnections.
首先,它使每个数据集具备多订阅者且有序的特性。回顾我们提到的“状态复制”原则,就能理解顺序的重要性。具体来说,假设有一个来自数据库的更新流——如果我们在处理过程中对同一记录的两个更新进行重新排序,可能会导致错误的最终输出。这种顺序比 TCP 等协议提供的更为持久,因为它不仅限于单一的点对点连接,还能在进程故障和重新连接后继续保持。

Second, the log provides buffering to the processes. This is very fundamental. If processing proceeds in an unsynchronized fashion it is likely to happen that an upstream data producing job will produce data more quickly than another downstream job can consume it. When this occurs processing must block, buffer or drop data. Dropping data is likely not an option; blocking may cause the entire processing graph to grind to a halt. The log acts as a very, very large buffer that allows process to be restarted or fail without slowing down other parts of the processing graph. This isolation is particularly important when extending this data flow to a larger organization, where processing is happening by jobs made by many different teams. We cannot have one faulty job cause back-pressure that stops the entire processing flow.
其次,日志为处理流程提供了缓冲机制,这一点至关重要。若数据处理以非同步方式进行,很可能会出现上游数据生产作业生成数据的速度快于下游作业消费数据的情况。此时,处理过程必须选择阻塞、缓冲或丢弃数据。丢弃数据通常不可行;阻塞则可能导致整个处理链路陷入停滞。日志作为一个极其庞大的缓冲区,允许进程重启或失败时不影响处理链路其他部分的运行速度。这种隔离性在将数据流扩展至大型组织时尤为重要——当处理任务由多个不同团队开发的作业共同完成时,我们绝不能允许某个故障作业引发反压效应,致使整个处理流程停滞。

Both Storm and Samza are built in this fashion and can use Kafka or other similar systems as their log.
Storm 和 Samza 均基于此理念构建,并可采用 Kafka 或其他类似系统作为其日志基础设施。

Stateful Real-Time Processing
有状态实时处理

Some real-time stream processing is just stateless record-at-a-time transformation, but many of the uses are more sophisticated counts, aggregations, or joins over windows in the stream. One might, for example, want to enrich an event stream (say a stream of clicks) with information about the user doing the click—in effect joining the click stream to the user account database. Invariably, this kind of processing ends up requiring some kind of state to be maintained by the processor: for example, when computing a count, you have the count so far to maintain. How can this kind of state be maintained correctly if the processors themselves can fail?
一些实时流处理仅涉及无状态的逐条记录转换,但更多应用场景需要对流中的窗口进行复杂的计数、聚合或连接操作。例如,可能希望用点击用户的信息(如用户账户数据库中的记录)来丰富事件流(比如点击流),这实际上是将点击流与用户账户数据库进行连接。这类处理最终总是需要处理器维护某种状态:例如,在计算计数时,需要保存当前的累计值。如果处理器本身可能发生故障,如何正确维护这类状态?

The simplest alternative would be to keep state in memory. However if the process crashed it would lose its intermediate state. If state is only maintained over a window, the process could just fall back to the point in the log where the window began. However, if one is doing a count over an hour, this may not be feasible.
最简单的方案是将状态保存在内存中。但如果进程崩溃,就会丢失中间状态。若状态仅针对某个窗口维护,进程可以回退到日志中该窗口起始点重新处理。然而,如果进行一小时的计数统计,这种方法可能不可行。

An alternative is to simply store all state in a remote storage system and join over the network to that store. The problem with this is that there is no locality of data and lots of network round-trips.
另一种方案是将所有状态存储在远程存储系统中,并通过网络连接访问该存储。问题在于数据缺乏本地性,且需要大量网络往返通信。

How can we support something like a "table" that is partitioned up with our processing?
我们如何支持类似“表”这样的结构,使其与我们的处理过程分区配合?

Well recall the discussion of the duality of tables and logs. This gives us exactly the tool to be able to convert streams to tables co-located with our processing, as well as a mechanism for handling fault tolerance for these tables.
回想一下关于表与日志二元性的讨论。这为我们提供了将流转换为与处理过程共置的表的工具,同时也为这些表的容错处理提供了一种机制。

A stream processor can keep it's state in a local "table" or "index"—a bdb, leveldb, or even something more unusual such as a Lucene or fastbit index. The contents of this this store is fed from its input streams (after first perhaps applying arbitrary transformation). It can journal out a changelog for this local index it keeps to allow it to restore its state in the event of a crash and restart. This mechanism allows a generic mechanism for keeping co-partitioned state in arbitrary index types local with the incoming stream data.
流处理器可以将其状态保存在本地的“表”或“索引”中——比如 bdb、leveldb,甚至是更特殊的如 Lucene 或 fastbit 索引。该存储的内容由其输入流提供(可能首先经过任意转换)。它可以为该本地索引记录变更日志,以便在崩溃重启后恢复状态。这一机制允许以通用方式在任意类型的索引中保持与输入流数据共分区的状态。

When the process fails, it restores its index from the changelog. The log is the transformation of the local state into a sort of incremental record at a time backup.
当进程失败时,它会从变更日志中恢复其索引。日志是将本地状态转换为一种逐条记录的增量备份的过程。

This approach to state management has the elegant property that the state of the processors is also maintained as a log. We can think of this log just like we would the log of changes to a database table. In fact, the processors have something very like a co-partitioned table maintained along with them. Since this state is itself a log, other processors can subscribe to it. This can actually be quite useful in cases when the goal of the processing is to update a final state and this state is the natural output of the processing.
这种状态管理方法具有一个优雅特性:处理器的状态本身也以日志形式维护。我们可以将此日志类比为数据库表的变更日志。实际上,处理器维护着与之高度协同分区的类似表结构。由于该状态本身也是日志,其他处理器可以订阅它。当处理目标是为了更新最终状态且该状态是处理的自然输出时,这一特性会显得尤为实用。

When combined with the logs coming out of databases for data integration purposes, the power of the log/table duality becomes clear. A change log may be extracted from a database and indexed in different forms by various stream processors to join against event streams.
当与数据库输出的用于数据集成目的的日志结合时,日志/表二元性的强大之处便清晰显现。可以从数据库提取变更日志,并通过不同流处理器以多种形式建立索引,进而与事件流进行关联。

We give more detail on this style of managing stateful processing in Samza and a lot more practical examples here.
我们在 Samza 中更详细地阐述了这种有状态处理的管理方式,并提供了更多实际案例参考此处。

Log Compaction  日志压缩

Of course, we can't hope to keep a complete log for all state changes for all time. Unless one wants to use infinite space, somehow the log must be cleaned up. I'll talk a little about the implementation of this in Kafka to make it more concrete. In Kafka, cleanup has two options depending on whether the data contains keyed updates or event data. For event data, Kafka supports just retaining a window of data. Usually, this is configured to a few days, but the window can be defined in terms of time or space. For keyed data, though, a nice property of the complete log is that you can replay it to recreate the state of the source system (potentially recreating it in another system).
当然,我们不可能期望永久保留所有状态变更的完整日志。除非愿意使用无限存储空间,否则日志必须以某种方式清理。为了让这一点更具体,我将简要谈谈 Kafka 中实现这一功能的方式。在 Kafka 中,清理操作根据数据是否包含键控更新或事件数据分为两种选项。对于事件数据,Kafka 仅支持保留一个数据窗口。通常,这个窗口被配置为几天,但也可以根据时间或空间来定义窗口大小。然而,对于键控数据,完整日志的一个优点是你可以重放它来重建源系统的状态(可能在另一个系统中重建)。

However, retaining the complete log will use more and more space as time goes by, and the replay will take longer and longer. Hence, in Kafka, we support a different type of retention. Instead of simply throwing away the old log, we remove obsolete records—i.e. records whose primary key has a more recent update. By doing this, we still guarantee that the log contains a complete backup of the source system, but now we can no longer recreate all previous states of the source system, only the more recent ones. We call this feature log compaction.
然而,保留完整的日志会随着时间的推移占用越来越多的空间,回放所需时间也会越来越长。因此,在 Kafka 中,我们支持另一种保留方式——并非简单地丢弃旧日志,而是移除过时记录(即主键已有更近更新的记录)。通过这种方式,我们仍能确保日志包含源系统的完整备份,但无法再重建源系统的所有历史状态,仅能保留较新的状态。我们将此功能称为日志压缩。

Part Four: System Building
第四部分:系统构建

The final topic I want to discuss is the role of the log in data system design for online data systems.
最后我想讨论的主题是日志在在线数据系统设计中的作用。

There is an analogy here between the role a log serves for data flow inside a distributed database and the role it serves for data integration in a larger organization. In both cases, it is responsible for data flow, consistency, and recovery. What, after all, is an organization, if not a very complicated distributed data system?
这里存在一个类比:日志在分布式数据库内部数据流中扮演的角色,与其在大型组织数据集成中发挥的作用如出一辙。在这两种场景下,它都负责数据流动、一致性与恢复。毕竟,一个组织若不视为一个极度复杂的分布式数据系统,又是什么呢?

Unbundling?  解耦?

So maybe if you squint a bit, you can see the whole of your organization's systems and data flows as a single distributed database. You can view all the individual query-oriented systems (Redis, SOLR, Hive tables, and so on) as just particular indexes on your data. You can view the stream processing systems like Storm or Samza as just a very well-developed trigger and view materialization mechanism. Classical database people, I have noticed, like this view very much because it finally explains to them what on earth people are doing with all these different data systems—they are just different index types!
或许稍加细想,你可以将整个组织的系统与数据流视为一个单一的分布式数据库。所有面向查询的独立系统(如 Redis、SOLR、Hive 表等)都可看作数据上的特定索引。而像 Storm 或 Samza 这样的流处理系统,则可视为高度完善的触发器与物化视图机制。我注意到,传统数据库领域的人尤其青睐这种视角,因为它终于向他们阐明了人们为何要使用这些五花八门的数据系统——它们不过是不同类型的索引罢了!

There is undeniably now an explosion of types of data systems, but in reality, this complexity has always existed. Even in the heyday of the relational database, organizations had lots and lots of relational databases! So perhaps real integration hasn't existed since the mainframe when all the data really was in one place. There are many motivations for segregating data into multiple systems: scale, geography, security, and performance isolation are the most common. But these issues can be addressed by a good system: it is possible for an organization to have a single Hadoop cluster, for example, that contains all the data and serves a large and diverse constituency.
不可否认,如今各类数据系统如雨后春笋般涌现,但实际上这种复杂性始终存在。即使在关系型数据库的鼎盛时期,企业也拥有大量关系型数据库!因此,真正的数据整合可能自大型机时代以来就未曾实现过——那时所有数据确实集中存储于一处。将数据分散到多个系统的动机多种多样:规模、地域、安全性和性能隔离是最常见的原因。但优秀的系统能够解决这些问题:例如,一个组织完全可以构建包含所有数据的单一 Hadoop 集群,并为庞大多元的用户群体提供服务。

So there is already one possible simplification in the handling of data that has become possible in the move to distributed systems: coalescing lots of little instances of each system into a few big clusters. Many systems aren't good enough to allow this yet: they don't have security, or can't guarantee performance isolation, or just don't scale well enough. But each of these problems is solvable.
因此在向分布式系统演进的过程中,数据处理已出现一种可能的简化方式:将每个系统的众多小型实例整合为少量大型集群。目前许多系统尚不足以支持这种整合:它们缺乏完善的安全机制、无法保证性能隔离,或者扩展性不足。但这些问题都是可以解决的。

My take is that the explosion of different systems is caused by the difficulty of building distributed data systems. By cutting back to a single query type or use case each system is able to bring its scope down into the set of things that are feasible to build. But running all these systems yields too much complexity.
我认为,不同系统的激增源于构建分布式数据系统的困难。通过将范围缩小到单一查询类型或使用场景,每个系统得以将自身限定在可实现的构建范围内。但运行所有这些系统会带来过多的复杂性。

I see three possible directions this could follow in the future.
我看到未来可能遵循的三个方向。

The first possibility is a continuation of the status quo: the separation of systems remains more or less as it is for a good deal longer. This could happen either because the difficulty of distribution is too hard to overcome or because this specialization allows new levels of convenience and power for each system. As long as this remains true, the data integration problem will remain one of the most centrally important things for the successful use of data. In this case, an external log that integrates data will be very important.
第一种可能是维持现状:系统分离的状态将持续相当长的时间。这可能是因为分布式处理的难度过高难以克服,或因为这种专业化为每个系统带来了新的便利性和能力层次。只要这一情况持续,数据整合问题就仍将是数据成功应用中最核心的挑战之一。在这种情况下,用于整合数据的外部日志将极为重要。

The second possibility is that there could be a re-consolidation in which a single system with enough generality starts to merge back in all the different functions into a single uber-system. This uber-system could be like the relational database superficially, but it's use in an organization would be far different as you would need only one big one instead of umpteen little ones. In this world, there is no real data integration problem except what is solved inside this system. I think the practical difficulties of building such a system make this unlikely.
第二种可能性是可能出现一次重新整合,一个具备足够通用性的单一系统开始将所有不同功能合并回一个超级系统中。这个超级系统表面上看可能类似于关系型数据库,但它在组织中的使用方式会大不相同,因为你只需要一个大型系统而非无数小型系统。在这种情境下,除了该系统内部已解决的问题外,实际上并不存在真正的数据集成难题。不过我认为构建这样一个系统在实际操作中面临的困难使其不太可能实现。

There is another possible outcome, though, which I actually find appealing as an engineer. One interesting facet of the new generation of data systems is that they are virtually all open source. Open source allows another possibility: data infrastructure could be unbundled into a collection of services and application-facing system apis. You already see this happening to a certain extent in the Java stack:
然而还存在另一种可能的结局,作为一名工程师,我个人觉得这一前景颇具吸引力。新一代数据系统有一个有趣的特点:它们几乎都是开源的。开源带来了另一种可能性:数据基础设施可以被解耦成一系列服务和面向应用的系统 API。这种现象在 Java 技术栈中已初见端倪:

  • Zookeeper handles much of the system co-ordination (perhaps with a bit of help from higher-level abstractions like Helix or Curator).
    Zookeeper 处理了大部分系统协调工作(可能还需要借助 Helix 或 Curator 等更高层次的抽象工具辅助)。
  • Mesos and YARN do process virtualization and resource management
    Mesos 和 YARN 负责进程虚拟化与资源管理
  • Embedded libraries like Lucene and LevelDB do indexing
    像 Lucene 和 LevelDB 这样的嵌入式库负责索引处理
  • Netty, Jetty and higher-level wrappers like Finagle and rest.li handle remote communication
    Netty、Jetty 以及更高层次的封装如 Finagle 和 rest.li 处理远程通信
  • Avro, Protocol Buffers, Thrift, and umpteen zillion other libraries handle serialization
    Avro、Protocol Buffers、Thrift 以及无数其他库负责序列化
  • Kafka and Bookeeper provide a backing log.
    Kafka 和 Bookeeper 提供了底层日志支持

If you stack these things in a pile and squint a bit, it starts to look a bit like a lego version of distributed data system engineering. You can piece these ingredients together to create a vast array of possible systems. This is clearly not a story relevant to end-users who presumably care primarily more about the API then how it is implemented, but it might be a path towards getting the simplicity of the single system in a more diverse and modular world that continues to evolve. If the implementation time for a distributed system goes from years to weeks because reliable, flexible building blocks emerge, then the pressure to coalesce into a single monolithic system disappears.
若你将这些东西堆叠起来稍加眯眼观察,它就开始有点像分布式数据系统工程的乐高积木版。你可以将这些组件拼接起来,构建出无数可能的系统。显然,这对最终用户而言并非关键——他们可能更关心 API 而非实现方式,但这或许是一条通往在持续演进且多样化的模块化世界中重获单一系统简洁性的路径。如果分布式系统的实现时间因可靠、灵活的基础构件出现而从数年缩短至数周,那么向单一庞杂系统合并的压力自然消弭。

The place of the log in system architecture
日志在系统架构中的定位

A system that assumes an external log is present allows the individual systems to relinquish a lot of their own complexity and rely on the shared log. Here are the things I think a log can do:
一个预设外部日志存在的系统,可让各子系统卸下大量自身复杂性,转而依赖共享日志。我认为日志能实现以下功能:

  • Handle data consistency (whether eventual or immediate) by sequencing concurrent updates to nodes
    通过对节点并发更新进行排序,处理数据一致性(无论是最终一致性还是即时一致性)
  • Provide data replication between nodes
    在节点之间提供数据复制
  • Provide "commit" semantics to the writer (i.e. acknowledging only when your write guaranteed not to be lost)
    为写入者提供"提交"语义(即仅在写入保证不会丢失时才进行确认)
  • Provide the external data subscription feed from the system
    提供来自系统的外部数据订阅源
  • Provide the capability to restore failed replicas that lost their data or bootstrap new replicas
    提供恢复数据丢失的故障副本或引导新副本的能力
  • Handle rebalancing of data between nodes.
    处理节点间的数据重新平衡。

This is actually a substantial portion of what a distributed data system does. In fact, the majority of what is left over is related to the final client-facing query API and indexing strategy. This is exactly the part that should vary from system to system: for example, a full-text search query may need to query all partitions whereas a query by primary key may only need to query a single node responsible for that key's data.
这实际上是分布式数据系统所做工作的主要部分。事实上,剩下的多数工作与面向客户端的最终查询 API 及索引策略相关。而这正是不同系统间应当存在差异的部分:例如,全文搜索查询可能需要查询所有分区,而按主键查询可能只需访问负责该键数据的单个节点。

Here is how this works. The system is divided into two logical pieces: the log and the serving layer. The log captures the state changes in sequential order. The serving nodes store whatever index is required to serve queries (for example a key-value store might have something like a btree or sstable, a search system would have an inverted index). Writes may either go directly to the log, though they may be proxied by the serving layer. Writing to the log yields a logical timestamp (say the index in the log). If the system is partitioned, and I assume it is, then the log and the serving nodes will have the same number of partitions, though they may have very different numbers of machines.
其工作原理如下。系统分为两个逻辑部分:日志和服务层。日志按顺序捕获状态变更。服务节点存储支持查询所需的任何索引(例如键值存储可能采用 B 树或 SSTable 结构,搜索系统则使用倒排索引)。写入操作可直接提交至日志,也可能通过服务层代理完成。写入日志会生成一个逻辑时间戳(如日志中的索引号)。若系统采用分区设计(此处假设如此),则日志和服务节点的分区数量相同,但两者的物理机器数量可能差异很大。

The serving nodes subscribe to the log and apply writes as quickly as possible to its local index in the order the log has stored them.
服务节点订阅日志,并尽可能快地按照日志存储的顺序将写入操作应用到其本地索引中。

The client can get read-your-write semantics from any node by providing the timestamp of a write as part of its query—a serving node receiving such a query will compare the desired timestamp to its own index point and if necessary delay the request until it has indexed up to at least that time to avoid serving stale data.
客户端可以通过在查询中提供写入时间戳,从任何节点获取"读己之写"语义——接收此类查询的服务节点会将所需时间戳与自身索引点进行比较,若有必要则延迟请求,直到其索引至少更新至该时间点,以避免返回过时数据。

The serving nodes may or may not need to have any notion of "mastership" or "leader election". For many simple use cases, the serving nodes can be completely without leaders, since the log is the source of truth.
服务节点可能不需要(但也可能需要)具备"主节点"或"领导者选举"的概念。对于许多简单用例而言,由于日志本身就是真相来源,服务节点可以完全无需领导者。

One of the trickier things a distributed system must do is handle restoring failed nodes or moving partitions from node to node. A typical approach would have the log retain only a fixed window of data and combine this with a snapshot of the data stored in the partition. It is equally possible for the log to retain a complete copy of data and garbage collect the log itself. This moves a significant amount of complexity out of the serving layer, which is system-specific, and into the log, which can be general purpose.
分布式系统必须处理的棘手问题之一是恢复故障节点或在节点间迁移分区。典型方法是让日志仅保留固定时间窗口的数据,并将其与分区中存储的数据快照相结合。同样也可以让日志保留完整数据副本,并对日志本身进行垃圾回收。这将大量系统特定的复杂性从服务层转移到了可通用的日志层。

By having this log system, you get a fully developed subscription API for the contents of the data store which feeds ETL into other systems. In fact, many systems can share the same the log while providing different indexes, like this:
通过采用这种日志系统,您将获得一个完全开发的数据存储内容订阅 API,用于向其他系统输送 ETL(抽取、转换、加载)。实际上,许多系统可以共享同一份日志,同时提供不同的索引方式,如下所示:

Note how such a log-centric system is itself immediately a provider of data streams for processing and loading in other systems. Likewise, a stream processor can consume multiple input streams and then serve them via another system that indexes that output.
值得注意的是,这种以日志为核心的系统本身就能立即成为其他系统处理和加载数据流的提供者。同样地,流处理器可以消费多个输入流,然后通过另一个系统对这些输出建立索引并提供服务。

I find this view of systems as factored into a log and query api to very revealing, as it lets you separate the query characteristics from the availability and consistency aspects of the system. I actually think this is even a useful way to mentally factor a system that isn't built this way to better understand it.
我发现将系统视为由日志和查询 API 构成的视角极具启发性,因为它允许您将查询特性与系统的可用性和一致性方面分离开来。实际上,我认为这甚至是一种有用的思维方式,可以帮助理解那些并非按此方式构建的系统。

It's worth noting that although Kafka and Bookeeper are consistent logs, this is not a requirement. You could just as easily factor a Dynamo-like database into an eventually consistent AP log and a key-value serving layer. Such a log is a bit tricky to work with, as it will redeliver old messages and depends on the subscriber to handle this (much like Dynamo itself).
值得注意的是,尽管 Kafka 和 Bookeeper 是强一致性的日志系统,但这并非必要条件。您同样可以将类似 Dynamo 的数据库分解为一个最终一致性的 AP 日志和一个键值服务层。这种日志处理起来有些棘手,因为它会重新传递旧消息,并依赖于订阅者来处理这种情况(这与 Dynamo 本身非常相似)。

The idea of having a separate copy of data in the log (especially if it is a complete copy) strikes many people as wasteful. In reality, though there are a few factors that make this less of an issue. First, the log can be a particularly efficient storage mechanism. We store over 75TB per datacenter on our production Kafka servers. Meanwhile many serving systems require much more memory to serve data efficiently (text search, for example, is often all in memory). The serving system may also use optimized hardware. For example, most our live data systems either serve out of memory or else use SSDs. In contrast, the log system does only linear reads and writes, so it is quite happy using large multi-TB hard drives. Finally, as in the picture above, in the case where the data is served by multiple systems, the cost of the log is amortized over multiple indexes. This combination makes the expense of an external log pretty minimal.
在日志中保留数据的单独副本(尤其是完整副本)的想法让许多人觉得浪费。然而实际上,有几个因素使得这不再是个大问题。首先,日志可以是一种特别高效的存储机制。我们在生产环境的 Kafka 服务器上每个数据中心存储超过 75TB 的数据。与此同时,许多服务系统需要更多的内存才能高效地提供数据(例如,文本搜索通常完全在内存中进行)。服务系统还可能使用优化的硬件。例如,我们的大多数实时数据系统要么从内存提供服务,要么使用 SSD。相比之下,日志系统只进行线性读写,因此可以很好地使用大容量的多 TB 硬盘驱动器。最后,如上图所示,在数据由多个系统提供服务的情况下,日志的成本被分摊到多个索引上。这些因素综合起来,使得外部日志的开销变得非常小。

This is exactly the pattern that LinkedIn has used to build out many of its own real-time query systems. These systems feed off a database (using Databus as a log abstraction or off a dedicated log from Kafka) and provide a particular partitioning, indexing, and query capability on top of that data stream. This is the way we have implemented our search, social graph, and OLAP query systems. In fact, it is quite common to have a single data feed (whether a live feed or a derived feed coming from Hadoop) replicated into multiple serving systems for live serving. This has proven to be an enormous simplifying assumption. None of these systems need to have an externally accessible write api at all, Kafka and databases are used as the system of record and changes flow to the appropriate query systems through that log. Writes are handled locally by the nodes hosting a particular partition. These nodes blindly transcribe the feed provided by the log to their own store. A failed node can be restored by replaying the upstream log.
这正是 LinkedIn 用来构建其众多实时查询系统的模式。这些系统从数据库获取数据(使用 Databus 作为日志抽象或从 Kafka 获取专用日志),并在数据流之上提供特定的分区、索引和查询能力。我们正是以这种方式实现了搜索、社交图谱和 OLAP 查询系统。实际上,将单一数据源(无论是实时数据流还是来自 Hadoop 的衍生数据流)复制到多个服务系统以进行实时服务是非常常见的做法。这已被证明是一个极大简化系统设计的假设。这些系统完全不需要提供外部可访问的写入 API,Kafka 和数据库被用作记录系统,变更通过日志流向相应的查询系统。写入操作由托管特定分区的节点在本地处理,这些节点会盲目地将日志提供的数据流转录到自己的存储中。故障节点可以通过重放上游日志来恢复。

The degree to which these systems rely on the log varies. A fully reliant system could make use of the log for data partitioning, node restore, rebalancing, and all aspects of consistency and data propagation. In this setup, the actual serving tier is actually nothing less than a sort of "cache" structured to enable a particular type of processing with writes going directly to the log.
这些系统对日志的依赖程度各不相同。一个完全依赖日志的系统可以利用日志进行数据分区、节点恢复、重新平衡,以及所有与一致性和数据传播相关的方面。在这种设置中,实际的服务层实际上不过是一种为支持特定类型处理而构建的“缓存”,所有写入操作都直接进入日志。

The End  结束

If you made it this far you know most of what I know about logs.
如果你已经读到这里,那么你对日志的了解就和我所知的相差无几了。

Here are a few interesting references you may want to check out.
这里有一些你可能感兴趣的参考资料值得查阅。

Everyone seems to uses different terms for the same things so it is a bit of a puzzle to connect the database literature to the distributed systems stuff to the various enterprise software camps to the open source world. Nonetheless, here are a few pointers in the general direction.
似乎每个人对相同事物都使用不同的术语,因此要将数据库文献、分布式系统内容、各类企业软件阵营以及开源世界联系起来,着实需要一番梳理。不过,这里还是提供一些大致方向的指引。

Academic papers, systems, talks, and blogs:
学术论文、系统、演讲及博客资源:

  • A good overview of state machine and primary-backup replication
    关于状态机和主备复制的一个良好概述
  • PacificA is a generic framework for implementing log-based distributed storage systems at Microsoft.
    PacificA 是微软用于实现基于日志的分布式存储系统的通用框架。
  • Spanner—Not everyone loves logical time for their logs. Google's new database tries to use physical time and models the uncertainty of clock drift directly by treating the timestamp as a range.
    Spanner——并非所有人都喜欢在日志中使用逻辑时间。谷歌的新数据库尝试直接使用物理时间,并通过将时间戳视为一个范围来直接模拟时钟漂移的不确定性。
  • Datanomic: Deconstructing the database is a great presentation by Rich Hickey, the creator of Clojure, on his startup's database product.
    Datanomic:解构数据库是 Clojure 语言创始人 Rich Hickey 关于其初创公司数据库产品的精彩演讲。
  • A Survey of Rollback-Recovery Protocols in Message-Passing Systems. I found this to be a very helpful introduction to fault-tolerance and the practical application of logs to recovery outside databases.
    《消息传递系统中的回滚恢复协议综述》。我发现这篇文章对容错性以及日志在数据库之外的恢复实践应用提供了非常有帮助的入门指导。
  • Reactive Manifesto—I'm actually not quite sure what is meant by reactive programming, but I think it means the same thing as "event driven". This link doesn't have much info, but this class by Martin Odersky (of Scala fame) looks facinating.
    《响应式宣言》——其实我并不完全清楚响应式编程具体指什么,但我想它与“事件驱动”是同一个意思。这个链接信息不多,不过 Scala 之父 Martin Odersky 的这门课程看起来非常吸引人。
  • Paxos!   Paxos 算法!
    • Original paper is here. Leslie Lamport has an interesting history of how the algorithm was created in the 1980s but not published until 1998 because the reviewers didn't like the Greek parable in the paper and he didn't want to change it.
      原始论文在此。Leslie Lamport 讲述了该算法在 20 世纪 80 年代诞生却直到 1998 年才发表的趣史,原因是审稿人不喜欢论文中的希腊寓言,而他拒绝修改。
    • Even once the original paper was published it wasn't well understood. Lamport tries again and this time even includes a few of the "uninteresting details" of how to put it to use using these new-fangled automatic computers. It is still not widely understood.
      即便原始论文发表后,人们也未能很好理解其内容。Lamport 再次尝试阐述,这次甚至加入了如何利用这些新奇的自动计算机付诸实践的"琐碎细节"。然而,这一理论至今仍未得到广泛理解。
    • Fred Schneider and Butler Lampson each give more detailed overview of applying Paxos in real systems.
      Fred Schneider 和 Butler Lampson 分别就 Paxos 算法在真实系统中的应用提供了更为详尽的概述。
    • A few Google engineers summarize their experience implementing Paxos in Chubby.
      几位谷歌工程师总结了他们在 Chubby 系统中实现 Paxos 算法的实践经验。
    • I actually found all the Paxos papers pretty painful to understand but dutifully struggled through. But you don't need to because this video by John Ousterhout (of log-structured filesystem fame!) will make it all very simple. Somehow these consensus algorithms are much better presented by drawing them as the communication rounds unfold, rather than in a static presentation in a paper. Ironically, this video was created in an attempt to show that Paxos was hard to understand.
      坦白说,我觉得所有关于 Paxos 的论文都艰涩难懂,但仍尽职地啃完了。不过你其实不必如此,因为 John Ousterhout(以日志结构文件系统闻名!)的这个视频会让一切变得简单。共识算法通过动态展示通信轮次演进的方式呈现,远比静态论文更易于理解。颇具讽刺的是,这个视频本意是为了证明 Paxos 难以理解而制作的。
    • Using Paxos to Build a Scalable Consistent Data Store: This is a cool paper on using a log to build a data store, by Jun, one of the co-authors is also one of the earliest engineers on Kafka.
      使用 Paxos 构建可扩展的一致性数据存储:这是一篇关于利用日志构建数据存储的精彩论文,作者之一是 Jun,他也是 Kafka 最早的工程师之一。
  • Paxos has competitors! Actually each of these map a lot more closely to the implementation of a log and are probably more suitable for practical implementation:
    Paxos 也有竞争对手!实际上这些替代方案更贴近日志的实现方式,可能更适合实际应用部署:
    • Viewstamped Replication by Barbara Liskov is an early algorithm to directly model log replication.
      Barbara Liskov 提出的视图戳复制(Viewstamped Replication)是最早直接模拟日志复制的算法之一。
    • Zab is the algorithm used by Zookeeper.
      Zab 是 Zookeeper 所采用的算法。
    • RAFT is an attempt at a more understandable consensus algorithm. The video presentation, also by John Ousterhout, is great too.
      RAFT 是一种更易于理解的共识算法尝试。由 John Ousterhout 主讲的视频演示同样非常精彩。
  • You can see the role of the log in action in different real distributed databases.
    你可以看到日志在不同实时分布式数据库中的实际作用。
    • PNUTS is a system which attempts to apply to log-centric design of traditional distributed databases at large scale.
      PNUTS 是一个尝试将传统分布式数据库以日志为核心的设计理念应用于大规模场景的系统。
    • HBase and Bigtable both give another example of logs in modern databases.
      HBase 和 Bigtable 都提供了现代数据库中日志应用的另一个范例。
    • LinkedIn's own distributed database Espresso, like PNUTs, uses a log for replication, but takes a slightly different approach using the underlying table itself as the source of the log.
      LinkedIn 自研的分布式数据库 Espresso 与 PNUTs 类似,也采用日志进行复制,但采用了略有不同的方法,将底层表本身作为日志的来源。
  • If you find yourself comparison shopping for a replication algorithm, this paper may help you out.
    如果您正在比较选择复制算法,这篇论文或许能为您提供帮助。
  • Replication: Theory and Practice is a great book that collects a bunch of summary papers on replication in distributed systems. Many of the chapters are online (e.g. 1, 4, 5, 6, 7, 8).
    《复制:理论与实践》是一本汇集了分布式系统中多篇复制技术综述论文的优秀书籍,其中多章内容可在网上查阅(例如第 1、4、5、6、7、8 章)。
  • Stream processing. This is a bit too broad to summarize, but here are a few things I liked.
    流处理。这个领域过于宽泛难以概述,但以下是我个人推荐的一些内容。
Enterprise software has all the same problems but with different names, a smaller scale, and XML. Ha ha, just kidding. Kind of.
企业软件存在完全相同的问题,只是名称不同、规模较小,外加 XML。哈哈,开个玩笑。算是吧。
  • Event Sourcing—As far as I can tell this is basically the enterprise software engineer's way of saying "state machine replication". It's interesting that the same idea would be invented again in such a different context. Event sourcing seems to focus on smaller, in-memory use cases. This approach to application development seems to combine the "stream processing" that occurs on the log of events with the application. Since this becomes pretty non-trivial when the processing is large enough to require data partitioning for scale I focus on stream processing as a separate infrastructure primitive.
    事件溯源——据我所知,这基本上是企业在软件工程师口中“状态机复制”的说法。有趣的是,相同的理念会在如此不同的背景下被重新发明。事件溯源似乎更关注较小的、内存中的使用场景。这种应用开发方法将发生在事件日志上的“流处理”与应用本身结合起来。由于当处理规模大到需要通过数据分区来实现扩展时,这种方法变得相当复杂,因此我将流处理作为一个独立的基础设施原语来重点讨论。
  • Change Data Capture—There is a small industry around getting data out of databases, and this is the most log-friendly style of data extraction.
    变更数据捕获——围绕从数据库中提取数据已经形成了一个小型产业,而这是最符合日志风格的数据提取方式。
  • Enterprise Application Integration seems to be about solving the data integration problem when what you have is a collection of off-the-shelf enterprise software like CRM or supply-chain management software.
    企业应用集成似乎旨在解决当您拥有一系列现成的企业软件(如客户关系管理或供应链管理软件)时的数据集成问题。
  • Complex Event Processing (CEP): Fairly certain nobody knows what this means or how it actually differs from stream processing. The difference seems to be that the focus is on unordered streams and on event filtering and detection rather than aggregation, but this, in my opinion is a distinction without a difference. I think any system that is good at one should be good at another.
    复杂事件处理(CEP):相当确定没人真正理解其含义或它与流处理的实质区别。区别似乎在于 CEP 更关注无序流以及事件过滤与检测而非聚合,但在我看来这不过是无实质差异的区分。我认为擅长其中之一的系统理应也能胜任另一项。
  • Enterprise Service Bus—I think the enterprise service bus concept is very similar to some of the ideas I have described around data integration. This idea seems to have been moderately successful in enterprise software communities and is mostly unknown among web folks or the distributed data infrastructure crowd.
    企业服务总线——我认为企业服务总线的概念与我描述的数据集成理念非常相似。这个想法在企业软件社区中取得了一定成功,但在互联网从业者或分布式数据基础设施圈中基本无人知晓。
Interesting open source stuff:
有趣的开源内容:
  • Kafka Is the "log as a service" project that is the basis for much of this post.
    Kafka 是本文所基于的"日志即服务"项目。
  • Bookeeper and Hedwig comprise another open source "log as a service". They seem to be more targeted at data system internals then at event data.
    Bookeeper 与 Hedwig 构成了另一个开源的"日志即服务"方案,它们似乎更侧重于数据系统内部而非事件数据。
  • Databus is a system that provides a log-like overlay for database tables.
    Databus 是一个为数据库表提供日志式覆盖层的系统。
  • Akka is an actor framework for Scala. It has an add on, eventsourced, that provides persistence and journaling.
    Akka 是 Scala 的 Actor 框架,其扩展组件 eventsourced 提供了持久化与日志记录功能。
  • Samza is a stream processing framework we are working on at LinkedIn. It uses a lot of the ideas in this article as well as integrating with Kafka as the underlying log.
    Samza 是我们在 LinkedIn 正在开发的流处理框架,它融合了本文中的诸多理念,并以 Kafka 作为底层日志进行集成。
  • Storm is popular stream processing framework that integrates well with Kafka.
    Storm 是一个与 Kafka 良好集成的流行流处理框架。
  • Spark Streaming is a stream processing framework that is part of Spark.
    Spark Streaming 是 Spark 生态系统中的流处理框架。
  • Summingbird is a layer on top of Storm or Hadoop that provides a convenient computing abstraction.
    Summingbird 是构建在 Storm 或 Hadoop 之上的抽象层,提供便捷的计算抽象。

I try to keep up on this area so if you know of some things I've left out, let me know.
我持续关注这一领域,若您发现我有遗漏之处,请不吝告知。

I leave you with this message:
最后,我想留给您这样一条信息:

Topics  主题