Parsing Expressions 解析表达式
Grammar, which knows how to control even kings.
语法,它知道如何控制甚至国王。Molière 莫里哀
This chapter marks the first major milestone of the
book. Many of us have cobbled together a mishmash of regular expressions and
substring operations to extract some sense out of a pile of text. The code was
probably riddled with bugs and a beast to maintain. Writing a real parser—one with decent error handling, a coherent internal structure, and the ability
to robustly chew through a sophisticated syntax—is considered a rare,
impressive skill. In this chapter, you will attain
it.
本章标志着本书的第一个重要里程碑。我们中的许多人曾拼凑过一堆正则表达式和子字符串操作,试图从一堆文本中提取出有意义的信息。这些代码可能漏洞百出,维护起来如同驯服野兽。编写一个真正的解析器——具备良好的错误处理、一致的内部结构,以及能够稳健处理复杂语法的能力——被视为一项罕见且令人印象深刻的技能。在本章中,你将掌握这一技能。
It’s easier than you think, partially because we front-loaded a lot of the hard
work in the last chapter. You already know your way around a formal grammar.
You’re familiar with syntax trees, and we have some Java classes to represent
them. The only remaining piece is parsing—transmogrifying a sequence of
tokens into one of those syntax trees.
这比你想象的要容易,部分原因在于我们在上一章已经完成了许多繁重的工作。你已经掌握了形式语法的基本知识,熟悉了语法树,并且我们有一些 Java 类来表示它们。剩下的唯一部分就是解析——将一系列标记转换为这些语法树之一。
Some CS textbooks make a big deal out of parsers. In the ’60s, computer
scientists—understandably tired of programming in assembly language—started designing more sophisticated, human-friendly
languages like Fortran and ALGOL. Alas, they weren’t very machine-friendly
for the primitive computers of the time.
一些计算机科学教材对解析器大书特书。在 60 年代,计算机科学家们——显然厌倦了用汇编语言编程——开始设计更复杂、更人性化的语言,如 Fortran 和 ALGOL。可惜的是,这些语言对当时原始的计算机并不十分友好。
These pioneers designed languages that they honestly weren’t even sure how to
write compilers for, and then did groundbreaking work inventing parsing and
compiling techniques that could handle these new, big languages on those old, tiny
machines.
这些先驱者设计了他们甚至不确定如何编写编译器的语言,然后进行了开创性的工作,发明了能够在那些老旧的小型机器上处理这些新的大型语言的解析和编译技术。
Classic compiler books read like fawning hagiographies of these heroes and their
tools. The cover of Compilers: Principles, Techniques, and Tools literally has
a dragon labeled “complexity of compiler design” being slain by a knight bearing
a sword and shield branded “LALR parser generator” and “syntax directed
translation”. They laid it on thick.
经典的编译器书籍读起来就像是对这些英雄及其工具的阿谀奉承的圣徒传记。《编译器:原理、技术与工具》的封面上,确实有一条被标记为“编译器设计的复杂性”的龙,正被一位手持标有“LALR 解析器生成器”和“语法导向翻译”的剑与盾的骑士所斩杀。他们描绘得相当夸张。
A little self-congratulation is well-deserved, but the truth is you don’t need
to know most of that stuff to bang out a high quality parser for a modern
machine. As always, I encourage you to broaden your education and take it in
later, but this book omits the trophy case.
适度的自我祝贺是应得的,但事实上,你并不需要了解大部分内容就能为现代机器编写出高质量的解析器。一如既往,我鼓励你拓宽知识面并在之后深入学习,但本书略去了那些荣誉展示。
6 . 1Ambiguity and the Parsing Game
歧义与解析游戏
In the last chapter, I said you can “play” a context-free grammar like a game in
order to generate strings. Parsers play that game in reverse. Given a string—a series of tokens—we map those tokens to terminals in the grammar to
figure out which rules could have generated that string.
在上一章中,我说过你可以像玩游戏一样“玩”一个上下文无关文法来生成字符串。解析器则逆向进行这个游戏。给定一个字符串——一系列标记——我们将这些标记映射到文法中的终结符,以找出哪些规则可能生成了该字符串。
The “could have” part is interesting. It’s entirely possible to create a grammar
that is ambiguous, where different choices of productions can lead to the same
string. When you’re using the grammar to generate strings, that doesn’t matter
much. Once you have the string, who cares how you got to it?
“可能”这部分很有趣。完全有可能创建一个歧义语法,其中不同的产生式选择可能导致相同的字符串。当你使用语法生成字符串时,这并不重要。一旦你有了字符串,谁在乎你是怎么得到的呢?
When parsing, ambiguity means the parser may misunderstand the user’s code. As
we parse, we aren’t just determining if the string is valid Lox code, we’re
also tracking which rules match which parts of it so that we know what part of
the language each token belongs to. Here’s the Lox expression grammar we put
together in the last chapter:
在解析时,歧义意味着解析器可能会误解用户的代码。当我们进行解析时,我们不仅仅是在确定字符串是否为有效的 Lox 代码,我们还在跟踪哪些规则匹配了它的哪些部分,以便我们知道每个标记属于语言的哪一部分。这是我们上一章整理的 Lox 表达式语法:
expression → literal | unary | binary | grouping ; literal → NUMBER | STRING | "true" | "false" | "nil" ; grouping → "(" expression ")" ; unary → ( "-" | "!" ) expression ; binary → expression operator expression ; operator → "==" | "!=" | "<" | "<=" | ">" | ">=" | "+" | "-" | "*" | "/" ;
This is a valid string in that grammar:
这是一个符合该语法的有效字符串:
But there are two ways we could have generated it. One way is:
但我们有两种生成方式。一种是:
- Starting at
expression
, pickbinary
.
从expression
开始,选择binary
。 - For the left-hand
expression
, pickNUMBER
, and use6
.
对于左侧的expression
,选择NUMBER
,并使用6
。 - For the operator, pick
"/"
.
对于操作员,选择"/"
。 - For the right-hand
expression
, pickbinary
again.
对于右侧的expression
,再次选择binary
。 - In that nested
binary
expression, pick3 - 1
.
在该嵌套的binary
表达式中,选择3 - 1
。
Another is: 另一个是:
- Starting at
expression
, pickbinary
.
从expression
开始,选择binary
。 - For the left-hand
expression
, pickbinary
again.
对于左侧的expression
,再次选择binary
。 - In that nested
binary
expression, pick6 / 3
.
在该嵌套的binary
表达式中,选择6 / 3
。 - Back at the outer
binary
, for the operator, pick"-"
.
回到外部的binary
,对于操作员,选择"-"
。 - For the right-hand
expression
, pickNUMBER
, and use1
.
对于右侧的expression
,选择NUMBER
,并使用1
。
Those produce the same strings, but not the same syntax trees:
那些产生相同的字符串,但不产生相同的语法树:
In other words, the grammar allows seeing the expression as (6 / 3) - 1
or 6 / (3 - 1)
. The binary
rule lets operands nest any which way you want. That in
turn affects the result of evaluating the parsed tree. The way mathematicians
have addressed this ambiguity since blackboards were first invented is by
defining rules for precedence and associativity.
换句话说,语法允许将表达式视为 (6 / 3) - 1
或 6 / (3 - 1)
。 binary
规则允许操作数以任何方式嵌套。这反过来又影响了解析树评估的结果。自黑板发明以来,数学家们解决这种歧义的方法是通过定义优先级和结合性的规则。
-
Precedence determines which operator is evaluated first in an expression containing a mixture of different operators. Precedence rules tell us that we evaluate the
/
before the-
in the above example. Operators with higher precedence are evaluated before operators with lower precedence. Equivalently, higher precedence operators are said to “bind tighter”.
优先级决定了在包含不同运算符的表达式中,哪个运算符首先被求值。优先级规则告诉我们,在上面的例子中,我们先对/
求值,再对-
求值。具有较高优先级的运算符在较低优先级的运算符之前被求值。同样地,较高优先级的运算符被称为“绑定更紧密”。 -
Associativity determines which operator is evaluated first in a series of the same operator. When an operator is left-associative (think “left-to-right”), operators on the left evaluate before those on the right. Since
-
is left-associative, this expression:
结合性决定了在一系列相同运算符中哪个运算符首先被求值。当运算符是左结合(即“从左到右”)时,左边的运算符先于右边的运算符求值。由于-
是左结合的,因此这个表达式:5 - 3 - 1
is equivalent to: 相当于:
(5 - 3) - 1
Assignment, on the other hand, is right-associative. This:
另一方面,赋值是右结合的。如下所示:a = b = c
is equivalent to: 相当于:
a = (b = c)
Without well-defined precedence and associativity, an expression that uses
multiple operators is ambiguous—it can be parsed into different syntax trees,
which could in turn evaluate to different results. We’ll fix that in Lox by
applying the same precedence rules as C, going from lowest to highest.
如果没有明确的优先级和结合性,使用多个运算符的表达式就会变得模糊不清——它可能被解析成不同的语法树,进而可能计算出不同的结果。我们将在 Lox 中通过应用与 C 语言相同的优先级规则来解决这个问题,从最低到最高。
Name 名称 | Operators | Associates |
Equality 平等 | == != |
Left 左 |
Comparison 比较 | > >=
< <= |
Left 左 |
Term | - + |
Left 左 |
Factor 因子 | / * |
Left 左 |
Unary 一元 | ! - |
Right |
Right now, the grammar stuffs all expression types into a single expression
rule. That same rule is used as the non-terminal for operands, which lets the
grammar accept any kind of expression as a subexpression, regardless of whether
the precedence rules allow it.
目前,语法将所有表达式类型都塞进了一个单一的 expression
规则中。该规则同样被用作操作数的非终结符,这使得语法能够接受任何类型的表达式作为子表达式,而不管优先级规则是否允许。
We fix that by stratifying the grammar. We define a
separate rule for each precedence level.
我们通过分层语法来解决这个问题。我们为每个优先级级别定义了一个单独的规则。
expression → ... equality → ... comparison → ... term → ... factor → ... unary → ... primary → ...
Each rule here only matches expressions at its precedence level or higher. For
example, unary
matches a unary expression like !negated
or a primary
expression like 1234
. And term
can match 1 + 2
but also 3 * 4 / 5
. The
final primary
rule covers the highest-precedence forms—literals and
parenthesized expressions.
这里的每条规则仅匹配其优先级或更高优先级的表达式。例如, unary
可以匹配像 !negated
这样的一元表达式,或者像 1234
这样的基本表达式。而 term
可以匹配 1 + 2
,也可以匹配 3 * 4 / 5
。最后的 primary
规则涵盖了最高优先级的表达式形式——字面量和括号内的表达式。
We just need to fill in the productions for each of those rules. We’ll do the
easy ones first. The top expression
rule matches any expression at any
precedence level. Since equality
has the lowest
precedence, if we match that, then it covers everything.
我们只需要为每个规则填写产生式。我们先从简单的开始。顶部的 expression
规则匹配任何优先级下的任何表达式。由于 equality
具有最低的优先级,如果我们匹配它,那么它就覆盖了所有情况。
expression → equality
Over at the other end of the precedence table, a primary expression contains
all the literals and grouping expressions.
在优先级表的另一端,主表达式包含所有字面量和分组表达式。
primary → NUMBER | STRING | "true" | "false" | "nil" | "(" expression ")" ;
A unary expression starts with a unary operator followed by the operand. Since
unary operators can nest—!!true
is a valid if weird expression—the
operand can itself be a unary operator. A recursive rule handles that nicely.
一元表达式以一元运算符开头,后跟操作数。由于一元运算符可以嵌套—— !!true
是一个有效但奇怪的表达式——操作数本身也可以是一元运算符。递归规则很好地处理了这种情况。
unary → ( "!" | "-" ) unary ;
But this rule has a problem. It never terminates.
但这条规则有一个问题。它永远不会终止。
Remember, each rule needs to match expressions at that precedence level or
higher, so we also need to let this match a primary expression.
记住,每条规则都需要匹配该优先级或更高优先级的表达式,因此我们还需要让它匹配一个基本表达式。
unary → ( "!" | "-" ) unary | primary ;
That works. 那行得通。
The remaining rules are all binary operators. We’ll start with the rule for
multiplication and division. Here’s a first try:
剩下的规则都是二元运算符。我们将从乘法和除法的规则开始。这是第一次尝试:
factor → factor ( "/" | "*" ) unary | unary ;
The rule recurses to match the left operand. That enables the rule to match a
series of multiplication and division expressions like 1 * 2 / 3
. Putting the
recursive production on the left side and unary
on the right makes the rule
left-associative and unambiguous.
该规则递归地匹配左操作数。这使得规则能够匹配一系列乘法和除法表达式,如 1 * 2 / 3
。将递归产生式放在左侧, unary
放在右侧,使规则具有左结合性且无歧义。
All of this is correct, but the fact that the first symbol in the body of the
rule is the same as the head of the rule means this production is
left-recursive. Some parsing techniques, including the one we’re going to
use, have trouble with left recursion. (Recursion elsewhere, like we have in
unary
and the indirect recursion for grouping in primary
are not a problem.)
所有这些都是正确的,但规则体中的第一个符号与规则头相同,这意味着这个产生式是左递归的。一些解析技术,包括我们将要使用的技术,在处理左递归时会遇到困难。(其他地方的递归,如我们在 unary
中的递归以及 primary
中用于分组的间接递归,则没有问题。)
There are many grammars you can define that match the same language. The choice
for how to model a particular language is partially a matter of taste and
partially a pragmatic one. This rule is correct, but not optimal for how we
intend to parse it. Instead of a left recursive rule, we’ll use a different one.
有许多语法可以定义来匹配同一种语言。如何为特定语言建模的选择部分取决于个人偏好,部分则是出于实用考虑。这条规则是正确的,但对于我们预期的解析方式来说并非最优。我们将使用另一条规则,而非左递归规则。
factor → unary ( ( "/" | "*" ) unary )* ;
We define a factor expression as a flat sequence of multiplications
and divisions. This matches the same syntax as the previous rule, but better
mirrors the code we’ll write to parse Lox. We use the same structure for all of
the other binary operator precedence levels, giving us this complete expression
grammar:
我们将因子表达式定义为一系列连续的乘法和除法运算。这与前一条规则的语法相匹配,但更好地反映了我们将编写的用于解析 Lox 的代码。我们对所有其他二元运算符优先级级别采用相同的结构,从而得到以下完整的表达式语法:
expression → equality ; equality → comparison ( ( "!=" | "==" ) comparison )* ; comparison → term ( ( ">" | ">=" | "<" | "<=" ) term )* ; term → factor ( ( "-" | "+" ) factor )* ; factor → unary ( ( "/" | "*" ) unary )* ; unary → ( "!" | "-" ) unary | primary ; primary → NUMBER | STRING | "true" | "false" | "nil" | "(" expression ")" ;
This grammar is more complex than the one we had before, but in return we have
eliminated the previous one’s ambiguity. It’s just what we need to make a
parser.
这个语法比我们之前的更复杂,但作为回报,我们消除了前一个语法的歧义。这正是我们构建解析器所需要的。
6 . 2Recursive Descent Parsing
递归下降解析
There is a whole pack of parsing techniques whose names are mostly combinations
of “L” and “R”—LL(k), LR(1), LALR—along with more exotic
beasts like parser combinators, Earley parsers, the shunting yard
algorithm, and packrat parsing. For our first interpreter, one
technique is more than sufficient: recursive descent.
有一整套解析技术,其名称大多是“L”和“R”的组合——LL(k)、LR(1)、LALR——以及更奇特的工具,如解析器组合子、Earley 解析器、调度场算法和 packrat 解析。对于我们的第一个解释器,一种技术就足够了:递归下降。
Recursive descent is the simplest way to build a parser, and doesn’t require
using complex parser generator tools like Yacc, Bison or ANTLR. All you need is
straightforward handwritten code. Don’t be fooled by its simplicity, though.
Recursive descent parsers are fast, robust, and can support sophisticated
error handling. In fact, GCC, V8 (the JavaScript VM in Chrome), Roslyn (the C#
compiler written in C#) and many other heavyweight production language
implementations use recursive descent. It rocks.
递归下降是构建解析器的最简单方法,且无需使用如 Yacc、Bison 或 ANTLR 等复杂的解析器生成工具。你所需要的只是直接手写的代码。不过,别被它的简单所迷惑。递归下降解析器速度快、健壮性强,并能支持复杂的错误处理。实际上,GCC、V8(Chrome 中的 JavaScript 虚拟机)、Roslyn(用 C#编写的 C#编译器)以及许多其他重量级生产语言实现都采用了递归下降。它真的很棒。
Recursive descent is considered a top-down parser because it starts from the
top or outermost grammar rule (here expression
) and works its way down into the nested subexpressions before finally
reaching the leaves of the syntax tree. This is in contrast with bottom-up
parsers like LR that start with primary expressions and compose them into larger
and larger chunks of syntax.
递归下降被认为是一种自顶向下的解析器,因为它从顶部或最外层的语法规则(此处为 expression
)开始,逐步深入到嵌套的子表达式中,最终到达语法树的叶子节点。这与自底向上的解析器(如 LR 解析器)形成对比,后者从基本表达式开始,逐步组合成越来越大的语法块。
A recursive descent parser is a literal translation of the grammar’s rules
straight into imperative code. Each rule becomes a function. The body of the
rule translates to code roughly like:
递归下降解析器是将语法规则直接转换为命令式代码的字面翻译。每个规则变成一个函数。规则的主体大致翻译为如下代码:
Grammar notation 语法符号 | Code representation 代码表示 |
Terminal | Code to match and consume a token 匹配并消费令牌的代码 |
Nonterminal | Call to that rule’s function 调用该规则的函数 |
| | if or switch statement if 或 switch 语句 |
* or + * 或 + | while or for loop while 或 for 循环 |
? | if statement |
The descent is described as “recursive” because when a grammar rule refers to
itself—directly or indirectly—that translates to a recursive function
call.
下降被称为“递归”是因为当语法规则直接或间接引用自身时,这转化为递归函数调用。
6 . 2 . 1The parser class 解析器类
Each grammar rule becomes a method inside this new class:
每个语法规则都成为这个新类中的一个方法:
create new file 创建新文件
package com.craftinginterpreters.lox; import java.util.List; import static com.craftinginterpreters.lox.TokenType.*; class Parser { private final List<Token> tokens; private int current = 0; Parser(List<Token> tokens) { this.tokens = tokens; } }
Like the scanner, the parser consumes a flat input sequence, only now we’re
reading tokens instead of characters. We store the list of tokens and use
current
to point to the next token eagerly waiting to be parsed.
与扫描器类似,解析器也消耗一个扁平的输入序列,只不过现在我们读取的是标记而非字符。我们存储标记列表,并使用 current
来指向急切等待被解析的下一个标记。
We’re going to run straight through the expression grammar now and translate
each rule to Java code. The first rule, expression
, simply expands to the
equality
rule, so that’s straightforward.
我们现在将直接运行表达式语法,并将每个规则翻译为 Java 代码。第一个规则 expression
简单地扩展为 equality
规则,因此这很简单。
add after Parser() 在 Parser()后添加
private Expr expression() { return equality(); }
Each method for parsing a grammar rule produces a syntax tree for that rule and
returns it to the caller. When the body of the rule contains a nonterminal—a
reference to another rule—we call that other rule’s
method.
每种解析语法规则的方法都会为该规则生成一个语法树,并将其返回给调用者。当规则的主体包含一个非终结符——即对另一个规则的引用时,我们会调用该规则的方法。
The rule for equality is a little more complex.
相等性的规则稍微复杂一些。
equality → comparison ( ( "!=" | "==" ) comparison )* ;
In Java, that becomes:
在 Java 中,这变为:
add after expression() 在 expression()后添加
private Expr equality() { Expr expr = comparison(); while (match(BANG_EQUAL, EQUAL_EQUAL)) { Token operator = previous(); Expr right = comparison(); expr = new Expr.Binary(expr, operator, right); } return expr; }
Let’s step through it. The first comparison
nonterminal in the body translates
to the first call to comparison()
in the method. We take that result and store
it in a local variable.
让我们逐步分析。正文中的第一个 comparison
非终结符转换为方法中对 comparison()
的第一次调用。我们获取该结果并将其存储在局部变量中。
Then, the ( ... )*
loop in the rule maps to a while
loop. We need to know
when to exit that loop. We can see that inside the rule, we must first find
either a !=
or ==
token. So, if we don’t see one of those, we must be done
with the sequence of equality operators. We express that check using a handy
match()
method.
然后,规则中的 ( ... )*
循环映射到 while
循环。我们需要知道何时退出该循环。我们可以看到,在规则内部,我们必须首先找到 !=
或 ==
标记。因此,如果我们没有看到其中一个,那么我们必须已经完成了相等运算符的序列。我们使用一个方便的 match()
方法来表示该检查。
add after equality() 在 equality()后添加
private boolean match(TokenType... types) { for (TokenType type : types) { if (check(type)) { advance(); return true; } } return false; }
This checks to see if the current token has any of the given types. If so, it
consumes the token and returns true
. Otherwise, it returns false
and leaves
the current token alone. The match()
method is defined in terms of two more
fundamental operations.
此操作检查当前令牌是否具有任何给定类型。如果是,则消耗该令牌并返回 true
。否则,返回 false
并保持当前令牌不变。 match()
方法基于两个更基本的操作定义。
The check()
method returns true
if the current token is of the given type.
Unlike match()
, it never consumes the token, it only looks at it.
check()
方法在当前标记为给定类型时返回 true
。与 match()
不同,它从不消耗标记,仅查看它。
add after match() 在 match() 后添加
private boolean check(TokenType type) { if (isAtEnd()) return false; return peek().type == type; }
The advance()
method consumes the current token and returns it, similar to how
our scanner’s corresponding method crawled through characters.
advance()
方法消耗当前标记并返回它,类似于我们扫描器对应方法遍历字符的方式。
add after check() 在 check()后添加
private Token advance() { if (!isAtEnd()) current++; return previous(); }
These methods bottom out on the last handful of primitive operations.
这些方法最终归结为最后几个基本操作。
add after advance() 在 advance()后添加
private boolean isAtEnd() { return peek().type == EOF; } private Token peek() { return tokens.get(current); } private Token previous() { return tokens.get(current - 1); }
isAtEnd()
checks if we’ve run out of tokens to parse. peek()
returns the
current token we have yet to consume, and previous()
returns the most recently
consumed token. The latter makes it easier to use match()
and then access the
just-matched token.
isAtEnd()
检查我们是否已经用完要解析的标记。 peek()
返回我们尚未消耗的当前标记, previous()
返回最近消耗的标记。后者使得使用 match()
后访问刚刚匹配的标记更加方便。
That’s most of the parsing infrastructure we need. Where were we? Right, so if
we are inside the while
loop in equality()
, then we know we have found a
!=
or ==
operator and must be parsing an equality expression.
这就是我们所需的大部分解析基础设施。我们说到哪儿了?对了,如果我们在 equality()
中的 while
循环内,那么我们知道已经找到了 !=
或 ==
运算符,并且必须解析一个相等表达式。
We grab the matched operator token so we can track which kind of equality
expression we have. Then we call comparison()
again to parse the right-hand
operand. We combine the operator and its two operands into a new Expr.Binary
syntax tree node, and then loop around. For each iteration, we store the
resulting expression back in the same expr
local variable. As we zip through a
sequence of equality expressions, that creates a left-associative nested tree of
binary operator nodes.
我们获取匹配的运算符标记,以便跟踪所处理的等式表达式类型。接着,再次调用 comparison()
来解析右侧操作数。将运算符及其两个操作数组合成一个新的 Expr.Binary
语法树节点,然后循环处理。每次迭代时,我们将结果表达式存储回同一个 expr
局部变量中。当快速遍历一系列等式表达式时,这便构建了一个左结合嵌套的二元运算符节点树。
The parser falls out of the loop once it hits a token that’s not an equality
operator. Finally, it returns the expression. Note that if the parser never
encounters an equality operator, then it never enters the loop. In that case,
the equality()
method effectively calls and returns comparison()
. In that
way, this method matches an equality operator or anything of higher
precedence.
解析器一旦遇到非等值运算符的标记,就会退出循环。最后,它返回表达式。请注意,如果解析器从未遇到等值运算符,则它永远不会进入循环。在这种情况下,` equality()
` 方法实际上会调用并返回 ` comparison()
`。通过这种方式,该方法匹配等值运算符或任何更高优先级的表达式。
Moving on to the next rule . . .
继续下一个规则...
comparison → term ( ( ">" | ">=" | "<" | "<=" ) term )* ;
Translated to Java: 翻译为 Java:
add after equality() 在 equality()后添加
private Expr comparison() { Expr expr = term(); while (match(GREATER, GREATER_EQUAL, LESS, LESS_EQUAL)) { Token operator = previous(); Expr right = term(); expr = new Expr.Binary(expr, operator, right); } return expr; }
The grammar rule is virtually identical to equality
and so is the corresponding code. The only differences are the token types for
the operators we match, and the method we call for the operands—now
term()
instead of comparison()
. The remaining two binary operator rules
follow the same pattern.
语法规则与 equality
几乎相同,对应的代码也是如此。唯一的区别在于我们匹配的操作符的标记类型,以及我们为操作数调用的方法——现在是 term()
而不是 comparison()
。剩下的两个二元操作符规则遵循相同的模式。
In order of precedence, first addition and subtraction:
按优先级顺序,先进行加减法:
add after comparison() 在 comparison()后添加
private Expr term() { Expr expr = factor(); while (match(MINUS, PLUS)) { Token operator = previous(); Expr right = factor(); expr = new Expr.Binary(expr, operator, right); } return expr; }
And finally, multiplication and division:
最后,乘法和除法:
add after term() 在 term()后添加
private Expr factor() { Expr expr = unary(); while (match(SLASH, STAR)) { Token operator = previous(); Expr right = unary(); expr = new Expr.Binary(expr, operator, right); } return expr; }
That’s all of the binary operators, parsed with the correct precedence and
associativity. We’re crawling up the precedence hierarchy and now we’ve reached
the unary operators.
这就是所有的二元运算符,按照正确的优先级和结合性进行解析。我们正在爬升优先级层次结构,现在我们已经到达了一元运算符。
unary → ( "!" | "-" ) unary | primary ;
The code for this is a little different.
此代码略有不同。
add after factor() 在 factor()后添加
private Expr unary() { if (match(BANG, MINUS)) { Token operator = previous(); Expr right = unary(); return new Expr.Unary(operator, right); } return primary(); }
Again, we look at the current token to see how to
parse. If it’s a !
or -
, we must have a unary expression. In that case, we
grab the token and then recursively call unary()
again to parse the operand.
Wrap that all up in a unary expression syntax tree and we’re done.
再次,我们查看当前标记以确定如何解析。如果它是 !
或 -
,我们必须有一个一元表达式。在这种情况下,我们获取该标记,然后递归调用 unary()
来解析操作数。将所有内容包装在一元表达式语法树中,我们就完成了。
Otherwise, we must have reached the highest level of precedence, primary
expressions.
否则,我们必须已经达到了最高优先级的表达式,即基本表达式。
primary → NUMBER | STRING | "true" | "false" | "nil" | "(" expression ")" ;
Most of the cases for the rule are single terminals, so parsing is
straightforward.
大多数规则的案例都是单一终端,因此解析是直接的。
add after unary() 在 unary()后添加
private Expr primary() { if (match(FALSE)) return new Expr.Literal(false); if (match(TRUE)) return new Expr.Literal(true); if (match(NIL)) return new Expr.Literal(null); if (match(NUMBER, STRING)) { return new Expr.Literal(previous().literal); } if (match(LEFT_PAREN)) { Expr expr = expression(); consume(RIGHT_PAREN, "Expect ')' after expression."); return new Expr.Grouping(expr); } }
The interesting branch is the one for handling parentheses. After we match an
opening (
and parse the expression inside it, we must find a )
token. If
we don’t, that’s an error.
有趣的分支是处理括号的部分。当我们匹配到一个开头的 (
并解析其中的表达式后,必须找到一个 )
标记。如果没有找到,那就是一个错误。
6 . 3Syntax Errors 语法错误
A parser really has two jobs:
解析器实际上有两个任务:
-
Given a valid sequence of tokens, produce a corresponding syntax tree.
给定一个有效的标记序列,生成相应的语法树。 -
Given an invalid sequence of tokens, detect any errors and tell the user about their mistakes.
给定一个无效的令牌序列,检测任何错误并告知用户其错误。
Don’t underestimate how important the second job is! In modern IDEs and editors,
the parser is constantly reparsing code—often while the user is still editing
it—in order to syntax highlight and support things like auto-complete. That
means it will encounter code in incomplete, half-wrong states all the time.
不要低估第二项工作的重要性!在现代集成开发环境(IDE)和编辑器中,解析器会不断地重新解析代码——通常是在用户仍在编辑时——以便进行语法高亮和支持自动完成等功能。这意味着它将经常遇到不完整、半错误状态的代码。
When the user doesn’t realize the syntax is wrong, it is up to the parser to
help guide them back onto the right path. The way it reports errors is a large
part of your language’s user interface. Good syntax error handling is hard. By
definition, the code isn’t in a well-defined state, so there’s no infallible way
to know what the user meant to write. The parser can’t read your mind.
当用户未意识到语法错误时,解析器有责任引导他们回到正确的路径。错误报告的方式是您语言用户界面的重要组成部分。良好的语法错误处理是困难的。根据定义,代码未处于明确定义的状态,因此没有绝对可靠的方法来了解用户想要编写的内容。解析器无法读取您的思维。
There are a couple of hard requirements for when the parser runs into a syntax
error. A parser must:
解析器在遇到语法错误时有一些硬性要求。解析器必须:
-
Detect and report the error. If it doesn’t detect the error and passes the resulting malformed syntax tree on to the interpreter, all manner of horrors may be summoned.
检测并报告错误。如果未能检测到错误并将生成的有缺陷的语法树传递给解释器,可能会引发各种可怕的问题。 -
Avoid crashing or hanging. Syntax errors are a fact of life, and language tools have to be robust in the face of them. Segfaulting or getting stuck in an infinite loop isn’t allowed. While the source may not be valid code, it’s still a valid input to the parser because users use the parser to learn what syntax is allowed.
避免崩溃或挂起。语法错误是不可避免的,语言工具在面对它们时必须具备鲁棒性。不允许出现段错误或陷入无限循环的情况。虽然源代码可能不是有效的代码,但它仍然是解析器的有效输入,因为用户使用解析器来了解允许的语法。
Those are the table stakes if you want to get in the parser game at all, but you
really want to raise the ante beyond that. A decent parser should:
这些是进入解析器游戏的基本要求,但你确实需要在此基础上提高赌注。一个像样的解析器应该:
-
Be fast. Computers are thousands of times faster than they were when parser technology was first invented. The days of needing to optimize your parser so that it could get through an entire source file during a coffee break are over. But programmer expectations have risen as quickly, if not faster. They expect their editors to reparse files in milliseconds after every keystroke.
要快。计算机的速度比解析器技术刚发明时快了几千倍。那种需要优化解析器以便在咖啡休息时间内完成整个源文件解析的日子已经一去不复返了。但程序员的期望也以同样快的速度,甚至更快的速度上升。他们期望编辑器在每次按键后能在几毫秒内重新解析文件。 -
Report as many distinct errors as there are. Aborting after the first error is easy to implement, but it’s annoying for users if every time they fix what they think is the one error in a file, a new one appears. They want to see them all.
报告尽可能多的不同错误。在第一个错误后中止很容易实现,但如果用户每次修复他们认为文件中唯一的错误时,又出现新的错误,这会让他们感到烦恼。他们希望看到所有错误。 -
Minimize cascaded errors. Once a single error is found, the parser no longer really knows what’s going on. It tries to get itself back on track and keep going, but if it gets confused, it may report a slew of ghost errors that don’t indicate other real problems in the code. When the first error is fixed, those phantoms disappear, because they reflect only the parser’s own confusion. Cascaded errors are annoying because they can scare the user into thinking their code is in a worse state than it is.
最小化级联错误。一旦发现单个错误,解析器实际上就不再知道发生了什么。它试图让自己回到正轨并继续运行,但如果它感到困惑,可能会报告一连串的幽灵错误,这些错误并不表示代码中的其他实际问题。当第一个错误被修复时,这些幻影就会消失,因为它们仅反映了解析器自身的困惑。级联错误令人烦恼,因为它们可能会吓到用户,让他们认为自己的代码状态比实际情况更糟。
The last two points are in tension. We want to report as many separate errors as
we can, but we don’t want to report ones that are merely side effects of an
earlier one.
最后两点存在矛盾。我们希望尽可能多地报告独立的错误,但又不希望报告那些仅仅是早期错误副作用的错误。
The way a parser responds to an error and keeps going to look for later errors
is called error recovery. This was a hot research topic in the ’60s. Back
then, you’d hand a stack of punch cards to the secretary and come back the next
day to see if the compiler succeeded. With an iteration loop that slow, you
really wanted to find every single error in your code in one pass.
解析器对错误作出响应并继续寻找后续错误的方式称为错误恢复。这是 20 世纪 60 年代的一个热门研究课题。那时候,你会把一叠穿孔卡片交给秘书,第二天再回来查看编译器是否成功。由于迭代循环如此缓慢,你真的希望在一次编译中找出代码中的所有错误。
Today, when parsers complete before you’ve even finished typing, it’s less of an
issue. Simple, fast error recovery is fine.
如今,解析器在你输入完成之前就已结束,这已不再是问题。简单快速的错误恢复就足够了。
6 . 3 . 1Panic mode error recovery
Panic 模式错误恢复
Of all the recovery techniques devised in yesteryear, the one that best stood
the test of time is called—somewhat alarmingly—panic
mode. As soon as the parser detects an error, it enters panic mode. It
knows at least one token doesn’t make sense given its current state in the
middle of some stack of grammar productions.
在昔日设计的所有恢复技术中,最能经受时间考验的是一种听起来有些令人不安的——恐慌模式。一旦解析器检测到错误,它就会进入恐慌模式。它知道至少有一个标记在当前语法产生式堆栈的中间状态下是没有意义的。
Before it can get back to parsing, it needs to get its state and the sequence of
forthcoming tokens aligned such that the next token does match the rule being
parsed. This process is called synchronization.
在恢复解析之前,它需要将其状态与即将到来的标记序列对齐,以确保下一个标记确实匹配正在解析的规则。这一过程称为同步。
To do that, we select some rule in the grammar that will mark the
synchronization point. The parser fixes its parsing state by jumping out of any
nested productions until it gets back to that rule. Then it synchronizes the
token stream by discarding tokens until it reaches one that can appear at that
point in the rule.
为此,我们在语法中选择一些规则来标记同步点。解析器通过跳出任何嵌套的产生式来固定其解析状态,直到返回到该规则。然后,它通过丢弃标记来同步标记流,直到到达可以在该规则点出现的标记。
Any additional real syntax errors hiding in those discarded tokens aren’t
reported, but it also means that any mistaken cascaded errors that are side
effects of the initial error aren’t falsely reported either, which is a decent
trade-off.
任何隐藏在那些被丢弃的标记中的额外真实语法错误都不会被报告,但这也意味着任何由初始错误引发的级联错误也不会被错误地报告,这是一个不错的权衡。
The traditional place in the grammar to synchronize is between statements. We
don’t have those yet, so we won’t actually synchronize in this chapter, but
we’ll get the machinery in place for later.
语法中传统的同步位置是在语句之间。我们目前还没有这些内容,因此本章实际上不会进行同步操作,但我们会为后续章节搭建好机制。
6 . 3 . 2Entering panic mode 进入紧急模式
Back before we went on this side trip around error recovery, we were writing the
code to parse a parenthesized expression. After parsing the expression, the
parser looks for the closing )
by calling consume()
. Here, finally, is that
method:
在我们绕道讨论错误恢复之前,我们正在编写解析带括号表达式的代码。解析完表达式后,解析器通过调用 consume()
来查找结束的 )
。最后,这就是那个方法:
add after match() 在 match() 后添加
private Token consume(TokenType type, String message) { if (check(type)) return advance(); throw error(peek(), message); }
It’s similar to match()
in that it checks to see if the next token is of the
expected type. If so, it consumes the token and everything is groovy. If some
other token is there, then we’ve hit an error. We report it by calling this:
它与 match()
类似,都会检查下一个标记是否为预期类型。如果是,则消耗该标记,一切顺利。如果出现其他标记,则意味着遇到了错误。我们通过调用以下内容来报告错误:
add after previous() 在上一个之后添加
private ParseError error(Token token, String message) { Lox.error(token, message); return new ParseError(); }
First, that shows the error to the user by calling:
首先,通过调用以下代码向用户显示错误:
add after report() 在 report()后添加
static void error(Token token, String message) { if (token.type == TokenType.EOF) { report(token.line, " at end", message); } else { report(token.line, " at '" + token.lexeme + "'", message); } }
This reports an error at a given token. It shows the token’s location and the
token itself. This will come in handy later since we use tokens throughout the
interpreter to track locations in code.
此函数报告给定令牌处的错误。它显示令牌的位置和令牌本身。由于我们在整个解释器中使用令牌来跟踪代码中的位置,这在以后会派上用场。
After we report the error, the user knows about their mistake, but what does the
parser do next? Back in error()
, we create and return a ParseError, an
instance of this new class:
在我们报告错误后,用户知道了他们的错误,但解析器接下来会做什么呢?回到 error()
,我们创建并返回一个 ParseError,这是新类的一个实例:
class Parser {
nest inside class Parser
嵌套在类 Parser 内部
private static class ParseError extends RuntimeException {}
private final List<Token> tokens;
This is a simple sentinel class we use to unwind the parser. The error()
method returns the error instead of throwing it because we want to let the
calling method inside the parser decide whether to unwind or not. Some parse
errors occur in places where the parser isn’t likely to get into a weird state
and we don’t need to synchronize. In those
places, we simply report the error and keep on truckin’.
这是一个简单的哨兵类,我们用它来展开解析器。 error()
方法返回错误而不是抛出它,因为我们想让解析器内部的调用方法决定是否展开。有些解析错误发生在解析器不太可能进入奇怪状态的地方,我们不需要同步。在这些地方,我们只需报告错误并继续前进。
For example, Lox limits the number of arguments you can pass to a function. If
you pass too many, the parser needs to report that error, but it can and should
simply keep on parsing the extra arguments instead of freaking out and going
into panic mode.
例如,Lox 限制了可以传递给函数的参数数量。如果传递了太多参数,解析器需要报告该错误,但它可以而且应该继续解析额外的参数,而不是惊慌失措并进入恐慌模式。
In our case, though, the syntax error is nasty enough that we want to panic and
synchronize. Discarding tokens is pretty easy, but how do we synchronize the
parser’s own state?
不过,在我们的例子中,语法错误严重到足以让我们想要报错并同步。丢弃标记相当容易,但我们如何同步解析器自身的状态呢?
6 . 3 . 3Synchronizing a recursive descent parser
同步递归下降解析器
With recursive descent, the parser’s state—which rules it is in the middle of
recognizing—is not stored explicitly in fields. Instead, we use Java’s
own call stack to track what the parser is doing. Each rule in the middle of
being parsed is a call frame on the stack. In order to reset that state, we need
to clear out those call frames.
使用递归下降法时,解析器的状态——即它正在识别哪些规则——并不显式地存储在字段中。相反,我们利用 Java 自身的调用栈来跟踪解析器的操作。每个正在解析的规则都是栈上的一个调用帧。为了重置该状态,我们需要清除这些调用帧。
The natural way to do that in Java is exceptions. When we want to synchronize,
we throw that ParseError object. Higher up in the method for the grammar rule
we are synchronizing to, we’ll catch it. Since we synchronize on statement
boundaries, we’ll catch the exception there. After the exception is caught, the
parser is in the right state. All that’s left is to synchronize the tokens.
在 Java 中实现这一点的自然方式是使用异常。当我们想要同步时,我们抛出那个 ParseError 对象。在我们同步到的语法规则的方法中更高层的位置,我们会捕获它。由于我们在语句边界上同步,我们会在那里捕获异常。捕获异常后,解析器处于正确的状态。剩下的就是同步标记了。
We want to discard tokens until we’re right at the beginning of the next
statement. That boundary is pretty easy to spot—it’s one of the main reasons
we picked it. After a semicolon, we’re probably
finished with a statement. Most statements start with a keyword—for
, if
,
return
, var
, etc. When the next token is any of those, we’re probably
about to start a statement.
我们希望丢弃标记,直到我们正好处于下一条语句的开头。这个边界很容易识别——这是我们选择它的主要原因之一。在分号之后,我们可能已经完成了一条语句。大多数语句以关键字开头—— for
、 if
、 return
、 var
等。当下一个标记是这些关键字之一时,我们可能即将开始一条语句。
This method encapsulates that logic:
该方法封装了该逻辑:
add after error() 在 error()后添加
private void synchronize() { advance(); while (!isAtEnd()) { if (previous().type == SEMICOLON) return; switch (peek().type) { case CLASS: case FUN: case VAR: case FOR: case IF: case WHILE: case PRINT: case RETURN: return; } advance(); } }
It discards tokens until it thinks it has found a statement boundary. After
catching a ParseError, we’ll call this and then we are hopefully back in sync.
When it works well, we have discarded tokens that would have likely caused
cascaded errors anyway, and now we can parse the rest of the file starting at
the next statement.
它会丢弃标记,直到认为找到了语句边界。在捕获到 ParseError 后,我们将调用此方法,希望此时能重新同步。当它运作良好时,我们已经丢弃了那些可能导致级联错误的标记,现在可以从下一个语句开始解析文件的其余部分。
Alas, we don’t get to see this method in action, since we don’t have statements
yet. We’ll get to that in a couple of chapters. For now, if an
error occurs, we’ll panic and unwind all the way to the top and stop parsing.
Since we can parse only a single expression anyway, that’s no big loss.
可惜的是,由于我们还没有语句,所以无法看到这个方法的具体应用。我们将在接下来的几章中讨论这个问题。目前,如果出现错误,我们会直接 panic 并回退到最上层,停止解析。反正我们只能解析单个表达式,所以这也没什么大不了的。
6 . 4Wiring up the Parser 连接解析器
We are mostly done parsing expressions now. There is one other place where we
need to add a little error handling. As the parser descends through the parsing
methods for each grammar rule, it eventually hits primary()
. If none of the
cases in there match, it means we are sitting on a token that can’t start an
expression. We need to handle that error too.
我们现在基本完成了表达式的解析。还有一个地方需要添加一些错误处理。当解析器通过每个语法规则的解析方法下降时,它最终会到达 primary()
。如果其中的所有情况都不匹配,这意味着我们遇到了一个不能作为表达式开头的标记。我们也需要处理这个错误。
if (match(LEFT_PAREN)) { Expr expr = expression(); consume(RIGHT_PAREN, "Expect ')' after expression."); return new Expr.Grouping(expr); }
in primary() 在 primary()中
throw error(peek(), "Expect expression.");
}
With that, all that remains in the parser is to define an initial method to kick
it off. That method is called, naturally enough, parse()
.
至此,解析器中剩下的就是定义一个初始方法来启动它。这个方法自然被称为 parse()
。
add after Parser() 在 Parser()后添加
Expr parse() { try { return expression(); } catch (ParseError error) { return null; } }
We’ll revisit this method later when we add statements to the language. For now,
it parses a single expression and returns it. We also have some temporary code
to exit out of panic mode. Syntax error recovery is the parser’s job, so we
don’t want the ParseError exception to escape into the rest of the interpreter.
我们稍后向语言中添加语句时会重新讨论这个方法。目前,它只解析单个表达式并返回它。我们还有一些临时代码用于退出恐慌模式。语法错误恢复是解析器的职责,因此我们不希望 ParseError 异常逃逸到解释器的其他部分。
When a syntax error does occur, this method returns null
. That’s OK. The
parser promises not to crash or hang on invalid syntax, but it doesn’t promise
to return a usable syntax tree if an error is found. As soon as the parser
reports an error, hadError
gets set, and subsequent phases are skipped.
当发生语法错误时,此方法返回 null
。这是正常的。解析器承诺不会因无效语法而崩溃或挂起,但它不承诺在发现错误时返回可用的语法树。一旦解析器报告错误, hadError
就会被设置,后续阶段将被跳过。
Finally, we can hook up our brand new parser to the main Lox class and try it
out. We still don’t have an interpreter, so for now, we’ll parse to a syntax
tree and then use the AstPrinter class from the last chapter to
display it.
最后,我们可以将全新的解析器连接到主 Lox 类并进行测试。由于我们还没有解释器,所以目前我们将解析为语法树,然后使用上一章的 AstPrinter 类来显示它。
Delete the old code to print the scanned tokens and replace it with this:
删除用于打印扫描标记的旧代码,并将其替换为以下内容:
List<Token> tokens = scanner.scanTokens();
in run() 在 run() 中
replace 5 lines 替换 5 行
Parser parser = new Parser(tokens); Expr expression = parser.parse(); // Stop if there was a syntax error. if (hadError) return; System.out.println(new AstPrinter().print(expression));
}
Congratulations, you have crossed the threshold! That
really is all there is to handwriting a parser. We’ll extend the grammar in
later chapters with assignment, statements, and other stuff, but none of that is
any more complex than the binary operators we tackled here.
恭喜你,已经跨过了门槛!手写解析器其实就这么简单。在后续章节中,我们将扩展语法,加入赋值、语句等内容,但这些都不会比我们在这里处理的二元运算符更复杂。
Fire up the interpreter and type in some expressions. See how it handles
precedence and associativity correctly? Not bad for less than 200 lines of code.
启动解释器并输入一些表达式。看看它是如何正确处理优先级和结合性的?对于不到 200 行代码来说,这还不错。
Challenges 挑战
-
In C, a block is a statement form that allows you to pack a series of statements where a single one is expected. The comma operator is an analogous syntax for expressions. A comma-separated series of expressions can be given where a single expression is expected (except inside a function call’s argument list). At runtime, the comma operator evaluates the left operand and discards the result. Then it evaluates and returns the right operand.
在 C 语言中,块是一种语句形式,允许你在预期单个语句的地方打包一系列语句。逗号运算符是表达式的类似语法。在预期单个表达式的地方(函数调用的参数列表内部除外),可以给出一个由逗号分隔的表达式系列。在运行时,逗号运算符会计算左操作数并丢弃结果,然后计算并返回右操作数。Add support for comma expressions. Give them the same precedence and associativity as in C. Write the grammar, and then implement the necessary parsing code.
添加对逗号表达式的支持。赋予它们与 C 语言中相同的优先级和结合性。编写语法,然后实现必要的解析代码。 -
Likewise, add support for the C-style conditional or “ternary” operator
?:
. What precedence level is allowed between the?
and:
? Is the whole operator left-associative or right-associative?
同样地,添加对 C 风格的条件或“三元”运算符?:
的支持。在?
和:
之间允许的优先级级别是什么?整个运算符是左结合还是右结合? -
Add error productions to handle each binary operator appearing without a left-hand operand. In other words, detect a binary operator appearing at the beginning of an expression. Report that as an error, but also parse and discard a right-hand operand with the appropriate precedence.
添加错误产生式以处理每个没有左操作数出现的二元运算符。换句话说,检测出现在表达式开头的二元运算符。将其报告为错误,但也要解析并丢弃具有适当优先级的右操作数。
Design Note: Logic Versus History
设计说明:逻辑与历史
Let’s say we decide to add bitwise &
and |
operators to Lox. Where should we
put them in the precedence hierarchy? C—and most languages that follow in C’s
footsteps—place them below ==
. This is widely considered a mistake because
it means common operations like testing a flag require parentheses.
假设我们决定在 Lox 中添加按位 &
和 |
运算符。我们应该将它们放在优先级层次结构中的哪个位置?C 语言——以及大多数追随 C 语言脚步的语言——将它们放在 ==
之下。这被广泛认为是一个错误,因为它意味着像测试标志这样的常见操作需要括号。
if (flags & FLAG_MASK == SOME_FLAG) { ... } // Wrong. if ((flags & FLAG_MASK) == SOME_FLAG) { ... } // Right.
Should we fix this for Lox and put bitwise operators higher up the precedence
table than C does? There are two strategies we can take.
我们应该为 Lox 修复这个问题,并将位运算符的优先级设置得比 C 语言更高吗?我们可以采取两种策略。
You almost never want to use the result of an ==
expression as the operand to
a bitwise operator. By making bitwise bind tighter, users don’t need to
parenthesize as often. So if we do that, and users assume the precedence is
chosen logically to minimize parentheses, they’re likely to infer it correctly.
几乎从不希望将 ==
表达式的结果用作位运算符的操作数。通过使位运算符绑定更紧密,用户不需要经常使用括号。因此,如果我们这样做,并且用户假设优先级是逻辑选择的以最小化括号,他们很可能会正确推断出它。
This kind of internal consistency makes the language easier to learn because
there are fewer edge cases and exceptions users have to stumble into and then
correct. That’s good, because before users can use our language, they have to
load all of that syntax and semantics into their heads. A simpler, more rational
language makes sense.
这种内部一致性使得语言更易于学习,因为用户需要遇到并纠正的边缘情况和例外更少。这是好事,因为在用户能够使用我们的语言之前,他们必须将所有语法和语义装入脑海。一个更简单、更合理的语言是有意义的。
But, for many users there is an even faster shortcut to getting our language’s
ideas into their wetware—use concepts they already know. Many newcomers to
our language will be coming from some other language or languages. If our
language uses some of the same syntax or semantics as those, there is much less
for the user to learn (and unlearn).
但是,对于许多用户来说,将我们语言的思想融入他们的“湿件”中有一个更快捷的途径——利用他们已经熟悉的概念。许多新接触我们语言的用户可能来自其他一种或多种语言背景。如果我们的语言采用了与那些语言相似的语法或语义,用户需要学习(以及摒弃)的内容就会少得多。
This is particularly helpful with syntax. You may not remember it well today,
but way back when you learned your very first programming language, code
probably looked alien and unapproachable. Only through painstaking effort did
you learn to read and accept it. If you design a novel syntax for your new
language, you force users to start that process all over again.
这对于语法特别有帮助。你可能今天记得不太清楚,但回想当初学习第一门编程语言时,代码可能看起来陌生且难以接近。只有通过艰苦的努力,你才学会阅读并接受它。如果你为你的新语言设计了一种全新的语法,你就迫使用户重新开始这个过程。
Taking advantage of what users already know is one of the most powerful tools
you can use to ease adoption of your language. It’s almost impossible to
overestimate how valuable this is. But it faces you with a nasty problem: What
happens when the thing the users all know kind of sucks? C’s bitwise operator
precedence is a mistake that doesn’t make sense. But it’s a familiar mistake
that millions have already gotten used to and learned to live with.
利用用户已有的知识是你可以用来促进语言采用的最强大工具之一。这一点的重要性几乎无法高估。但它也带来了一个棘手的问题:当用户所熟知的东西其实并不好时,该怎么办?C 语言的位运算符优先级就是一个毫无道理的错误。然而,这是一个数百万用户已经习惯并学会与之共处的熟悉错误。
Do you stay true to your language’s own internal logic and ignore history? Do
you start from a blank slate and first principles? Or do you weave your language
into the rich tapestry of programming history and give your users a leg up by
starting from something they already know?
你是否坚持自己语言的内部逻辑而忽视历史?你是否从零开始,遵循基本原则?还是你将语言编织进编程历史的丰富挂毯中,通过从用户已知的内容出发,为他们提供助力?
There is no perfect answer here, only trade-offs. You and I are obviously biased
towards liking novel languages, so our natural inclination is to burn the
history books and start our own story.
这里没有完美的答案,只有权衡。你和我显然倾向于喜欢新颖的语言,所以我们自然的倾向是烧掉历史书,开始我们自己的故事。
In practice, it’s often better to make the most of what users already know.
Getting them to come to your language requires a big leap. The smaller you can
make that chasm, the more people will be willing to cross it. But you can’t
always stick to history, or your language won’t have anything new and
compelling to give people a reason to jump over.
在实践中,充分利用用户已有的知识往往更为可取。让他们接受你的语言需要跨越巨大的鸿沟。你能够缩小的差距越小,愿意跨越它的人就越多。但你不能总是固守历史,否则你的语言将缺乏新颖和引人入胜之处,无法给人们提供跨越的理由。