这是用户在 2024-9-20 1:55 为 https://r4ds.hadley.nz/base-r 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

27  A field guide to base R
27基地 R 实地指南

27.1 Introduction
27.1简介

To finish off the programming section, we’re going to give you a quick tour of the most important base R functions that we don’t otherwise discuss in the book. These tools are particularly useful as you do more programming and will help you read code you’ll encounter in the wild.
为了结束编程部分,我们将带您快速浏览一下我们在本书中不会讨论的最重要的基本 R 函数。当您进行更多编程时,这些工具特别有用,并将帮助您阅读在野外遇到的代码。

This is a good place to remind you that the tidyverse is not the only way to solve data science problems. We teach the tidyverse in this book because tidyverse packages share a common design philosophy, increasing the consistency across functions, and making each new function or package a little easier to learn and use. It’s not possible to use the tidyverse without using base R, so we’ve actually already taught you a lot of base R functions: from library() to load packages, to sum() and mean() for numeric summaries, to the factor, date, and POSIXct data types, and of course all the basic operators like +, -, /, *, |, &, and !. What we haven’t focused on so far is base R workflows, so we will highlight a few of those in this chapter.
在此提醒您,tidyverse 并不是解决数据科学问题的唯一方法。我们在本书中教授 tidyverse 是因为 tidyverse 包共享一个共同的设计理念,提高了功能之间的一致性,并使每个新功能或包更容易学习和使用。如果不使用基本 R 就不可能使用 tidyverse,因此我们实际上已经教了您很多基本 R 函数:从用于加载包的library() ,到用于数字汇总的sum()mean() ,再到因子、日期和 POSIXct 数据类型,当然还有所有基本运算符,例如+-/*|& , 和! 。到目前为止,我们还没有关注基本的 R 工作流程,因此我们将在本章中重点介绍其中的一些工作流程。

After you read this book, you’ll learn other approaches to the same problems using base R, data.table, and other packages. You’ll undoubtedly encounter these other approaches when you start reading R code written by others, particularly if you’re using StackOverflow. It’s 100% okay to write code that uses a mix of approaches, and don’t let anyone tell you otherwise!
读完本书后,您将学习使用基础 R、data.table 和其他包解决相同问题的其他方法。当您开始阅读其他人编写的 R 代码时,您无疑会遇到这些其他方法,特别是当您使用 StackOverflow 时。编写使用混合方法的代码是 100% 可以的,不要让任何人告诉你否则!

In this chapter, we’ll focus on four big topics: subsetting with [, subsetting with [[ and $, the apply family of functions, and for loops. To finish off, we’ll briefly discuss two essential plotting functions.
在本章中,我们将重点关注四大主题:使用[进行子集化、使用[[$进行子集化、 apply 函数系列以及for循环。最后,我们将简要讨论两个基本的绘图函数。

27.1.1 Prerequisites
27.1.1先决条件

This package focuses on base R so doesn’t have any real prerequisites, but we’ll load the tidyverse in order to explain some of the differences.
该包专注于基础 R,因此没有任何真正的先决条件,但我们将加载 tidyverse 以解释一些差异。

27.2 Selecting multiple elements with [
27.2 用[选择多个元素

[ is used to extract sub-components from vectors and data frames, and is called like x[i] or x[i, j]. In this section, we’ll introduce you to the power of [, first showing you how you can use it with vectors, then how the same principles extend in a straightforward way to two-dimensional (2d) structures like data frames. We’ll then help you cement that knowledge by showing how various dplyr verbs are special cases of [.
[用于从向量和数据帧中提取子组件,称为x[i]x[i, j] 。在本节中,我们将向您介绍[的强大功能,首先向您展示如何将其与向量一起使用,然后将相同的原理如何以简单的方式扩展到数据帧等二维 (2d) 结构。然后,我们将通过展示各种 dplyr 动词如何是[的特殊情况来帮助您巩固这些知识。

27.2.1 Subsetting vectors
27.2.1向量取子集

There are five main types of things that you can subset a vector with, i.e., that can be the i in x[i]:
有五种主要类型的东西可以用来对向量进行子集化,即可以是x[i]中的i

  1. A vector of positive integers. Subsetting with positive integers keeps the elements at those positions:
    正整数向量。用正整数进行子集化会将元素保留在这些位置:

    x <- c("one", "two", "three", "four", "five")
    x[c(3, 2, 5)]
    #> [1] "three" "two"   "five"

    By repeating a position, you can actually make a longer output than input, making the term “subsetting” a bit of a misnomer.
    通过重复一个位置,您实际上可以产生比输入更长的输出,这使得术语“子集化”有点用词不当。

    x[c(1, 1, 5, 5, 5, 2)]
    #> [1] "one"  "one"  "five" "five" "five" "two"
  2. A vector of negative integers. Negative values drop the elements at the specified positions:
    负整数向量。负值会将元素放置在指定位置:

    x[c(-1, -3, -5)]
    #> [1] "two"  "four"
  3. A logical vector. Subsetting with a logical vector keeps all values corresponding to a TRUE value. This is most often useful in conjunction with the comparison functions.
    一个逻辑向量。使用逻辑向量进行子集保留所有值都对应于TRUE值。这通常与比较函数结合使用。

    x <- c(10, 3, NA, 5, 8, 1, NA)
    
    # All non-missing values of x
    x[!is.na(x)]
    #> [1] 10  3  5  8  1
    
    # All even (or missing!) values of x
    x[x %% 2 == 0]
    #> [1] 10 NA  8 NA

    Unlike filter(), NA indices will be included in the output as NAs.
    filter()不同, NA索引将作为NA包含在输出中。

  4. A character vector. If you have a named vector, you can subset it with a character vector:
    一个字符向量。如果您有一个命名向量,则可以将其与字符向量进行子集化:

    x <- c(abc = 1, def = 2, xyz = 5)
    x[c("xyz", "def")]
    #> xyz def 
    #>   5   2

    As with subsetting with positive integers, you can use a character vector to duplicate individual entries.
    与使用正整数进行子集化一样,您可以使用字符向量来复制各个条目。

  5. Nothing. The final type of subsetting is nothing, x[], which returns the complete x. This is not useful for subsetting vectors, but as we’ll see shortly, it is useful when subsetting 2d structures like tibbles.
    没有什么。子集的最终类型是空的x[] ,它返回完整的x 。这对于向量子集没有用,但正如我们很快就会看到的,它在对 2d 结构(如 tibbles)进行子集时很有用。

27.2.2 Subsetting data frames
27.2.2数据帧子集化

There are quite a few different ways that you can use [ with a data frame, but the most important way is to select rows and columns independently with df[rows, cols]. Here rows and cols are vectors as described above. For example, df[rows, ] and df[, cols] select just rows or just columns, using the empty subset to preserve the other dimension.
您可以通过多种不同的方式[与数据框一起使用,但最重要的方法是使用df[rows, cols]独立选择行和列。这里的rowscols是如上所述的向量。例如, df[rows, ]df[, cols]仅选择行或仅选择列,使用空子集保留其他维度。

Here are a couple of examples:
这里有几个例子:

df <- tibble(
  x = 1:3, 
  y = c("a", "e", "f"), 
  z = runif(3)
)

# Select first row and second column
df[1, 2]
#> # A tibble: 1 × 1
#>   y    
#>   <chr>
#> 1 a

# Select all rows and columns x and y
df[, c("x" , "y")]
#> # A tibble: 3 × 2
#>       x y    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 e    
#> 3     3 f

# Select rows where `x` is greater than 1 and all columns
df[df$x > 1, ]
#> # A tibble: 2 × 3
#>       x y         z
#>   <int> <chr> <dbl>
#> 1     2 e     0.834
#> 2     3 f     0.601

We’ll come back to $ shortly, but you should be able to guess what df$x does from the context: it extracts the x variable from df. We need to use it here because [ doesn’t use tidy evaluation, so you need to be explicit about the source of the x variable.
我们很快就会回到$ ,但是您应该能够从上下文中猜测df$x的作用:它从df中提取x变量。我们需要在这里使用它,因为[不使用整洁的评估,因此您需要明确x变量的来源。

There’s an important difference between tibbles and data frames when it comes to [. In this book, we’ve mainly used tibbles, which are data frames, but they tweak some behaviors to make your life a little easier. In most places, you can use “tibble” and “data frame” interchangeably, so when we want to draw particular attention to R’s built-in data frame, we’ll write data.frame. If df is a data.frame, then df[, cols] will return a vector if col selects a single column and a data frame if it selects more than one column. If df is a tibble, then [ will always return a tibble.
当涉及到[时,tibbles 和数据帧之间有一个重要的区别。在本书中,我们主要使用了 tibbles,数据框架,但它们调整了一些行为以使您的生活更轻松一些。在大多数地方,“tibble”和“数据帧”可以互换使用,因此当我们想要特别注意 R 的内置数据帧时,我们将编写data.frame 。如果dfdata.frame ,则df[, cols]如果col选择单列,则返回一个向量;如果选择多列,则 df[, cols] 将返回一个数据框。如果df是一个 tibble,那么[将始终返回一个 tibble。

df1 <- data.frame(x = 1:3)
df1[, "x"]
#> [1] 1 2 3

df2 <- tibble(x = 1:3)
df2[, "x"]
#> # A tibble: 3 × 1
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 3     3

One way to avoid this ambiguity with data.frames is to explicitly specify drop = FALSE:
避免data.frame出现这种歧义的一种方法是显式指定drop = FALSE

df1[, "x" , drop = FALSE]
#>   x
#> 1 1
#> 2 2
#> 3 3

27.2.3 dplyr equivalents
27.2.3 dplyr 等效项

Several dplyr verbs are special cases of [:
有几个 dplyr 动词是[的特例:

  • filter() is equivalent to subsetting the rows with a logical vector, taking care to exclude missing values:
    filter()相当于用逻辑向量对行进行子集化,注意排除缺失值:

    df <- tibble(
      x = c(2, 3, 1, 1, NA), 
      y = letters[1:5], 
      z = runif(5)
    )
    df |> filter(x > 1)
    
    # same as
    df[!is.na(df$x) & df$x > 1, ]

    Another common technique in the wild is to use which() for its side-effect of dropping missing values: df[which(df$x > 1), ].
    另一种常见的技术是使用which()来消除缺失值的副作用: df[which(df$x > 1), ]

  • arrange() is equivalent to subsetting the rows with an integer vector, usually created with order():
    arrange()相当于用整数向量对行进行子集化,通常用order()创建:

    df |> arrange(x, y)
    
    # same as
    df[order(df$x, df$y), ]

    You can use order(decreasing = TRUE) to sort all columns in descending order or -rank(col) to sort columns in decreasing order individually.
    您可以使用order(decreasing = TRUE)对所有列按降序排序,或使用-rank(col)对列单独按降序排序。

  • Both select() and relocate() are similar to subsetting the columns with a character vector:
    select()relocate()都类似于使用字符向量对列进行子集化:

    df |> select(x, z)
    
    # same as
    df[, c("x", "z")]

Base R also provides a function that combines the features of filter() and select() called subset():
Base R 还提供了一个结合了filter()select() 功能的函数,称为subset()

df |> 
  filter(x > 1) |> 
  select(y, z)
#> # A tibble: 2 × 2
#>   y           z
#>   <chr>   <dbl>
#> 1 a     0.157  
#> 2 b     0.00740
# same as
df |> subset(x > 1, c(y, z))

This function was the inspiration for much of dplyr’s syntax.
这个函数是 dplyr 大部分语法的灵感来源。

27.2.4 Exercises
27.2.4练习

  1. Create functions that take a vector as input and return:
    创建以向量作为输入并返回的函数:

    1. The elements at even-numbered positions.
      偶数位置的元素。
    2. Every element except the last value.
      除最后一个值之外的每个元素。
    3. Only even values (and no missing values).
      只有偶数值(并且没有缺失值)。
  2. Why is x[-which(x > 0)] not the same as x[x <= 0]? Read the documentation for which() and do some experiments to figure it out.
    为什么x[-which(x > 0)]x[x <= 0]不同?阅读which()的文档并做一些实验来找出答案。

27.3 Selecting a single element with $ and [[
27.3 用$和[[选择单个元素

[, which selects many elements, is paired with [[ and $, which extract a single element. In this section, we’ll show you how to use [[ and $ to pull columns out of data frames, discuss a couple more differences between data.frames and tibbles, and emphasize some important differences between [ and [[ when used with lists.
[选择许多元素,与[[$配对,提取单个元素。在本节中,我们将向您展示如何使用[[$从数据框中提取列,讨论data.frames和 tibbles 之间的更多差异,并强调[[[与列表一起使用时的一些重要差异。

27.3.1 Data frames
27.3.1数据帧

[[ and $ can be used to extract columns out of a data frame. [[ can access by position or by name, and $ is specialized for access by name:
[[$可用于从数据框中提取列。 [[可以按位置或按名称访问, $专门用于按名称访问:

tb <- tibble(
  x = 1:4,
  y = c(10, 4, 1, 21)
)

# by position
tb[[1]]
#> [1] 1 2 3 4

# by name
tb[["x"]]
#> [1] 1 2 3 4
tb$x
#> [1] 1 2 3 4

They can also be used to create new columns, the base R equivalent of mutate():
它们还可以用于创建新列,基本 R 相当于mutate()

tb$z <- tb$x + tb$y
tb
#> # A tibble: 4 × 3
#>       x     y     z
#>   <int> <dbl> <dbl>
#> 1     1    10    11
#> 2     2     4     6
#> 3     3     1     4
#> 4     4    21    25

There are several other base R approaches to creating new columns including with transform(), with(), and within(). Hadley collected a few examples at https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf.
还有其他几种创建新列的基本 R 方法,包括 with transform()with()within() 。 Hadley 在https://gist.github.com/hadley/1986a273e384fb2d4d752c18ed71bedf收集了一些示例。

Using $ directly is convenient when performing quick summaries. For example, if you just want to find the size of the biggest diamond or the possible values of cut, there’s no need to use summarize():
执行快速摘要时直接使用$很方便。例如,如果您只想查找最大钻石的尺寸或cut的可能值,则无需使用summarize()

max(diamonds$carat)
#> [1] 5.01

levels(diamonds$cut)
#> [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"

dplyr also provides an equivalent to [[/$ that we didn’t mention in : pull(). pull() takes either a variable name or variable position and returns just that column. That means we could rewrite the above code to use the pipe:
dplyr 还提供了一个与我们在中没有提到的[[ / $等效的函数: pull()pull()接受变量名或变量位置并仅返回该列。这意味着我们可以重写上面的代码来使用管道:

diamonds |> pull(carat) |> max()
#> [1] 5.01

diamonds |> pull(cut) |> levels()
#> [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"

27.3.2 Tibbles
27.3.2标题

There are a couple of important differences between tibbles and base data.frames when it comes to $. Data frames match the prefix of any variable names (so-called partial matching) and don’t complain if a column doesn’t exist:
当涉及到$时, tibbles 和基本data.frame之间有一些重要的区别。数据框匹配任何变量名的前缀(所谓的部分匹配),并且如果列不存在也不会抱怨:

df <- data.frame(x1 = 1)
df$x
#> [1] 1
df$z
#> NULL

Tibbles are more strict: they only ever match variable names exactly and they will generate a warning if the column you are trying to access doesn’t exist:
Tibbles 更严格:它们只与变量名称完全匹配,如果您尝试访问的列不存在,它们将生成警告:

tb <- tibble(x1 = 1)

tb$x
#> Warning: Unknown or uninitialised column: `x`.
#> NULL
tb$z
#> Warning: Unknown or uninitialised column: `z`.
#> NULL

For this reason we sometimes joke that tibbles are lazy and surly: they do less and complain more.
出于这个原因,我们有时会开玩笑说小猫是懒惰而乖戾的:他们做得更少,抱怨更多。

27.3.3 Lists
27.3.3列表

[[ and $ are also really important for working with lists, and it’s important to understand how they differ from [. Let’s illustrate the differences with a list named l:
[[$对于处理列表也非常重要,了解它们与[区别也很重要。让我们用名为l的列表来说明差异:

l <- list(
  a = 1:3, 
  b = "a string", 
  c = pi, 
  d = list(-1, -5)
)
  • [ extracts a sub-list. It doesn’t matter how many elements you extract, the result will always be a list.
    [提取子列表。无论提取多少元素,结果始终是一个列表。

    str(l[1:2])
    #> List of 2
    #>  $ a: int [1:3] 1 2 3
    #>  $ b: chr "a string"
    
    str(l[1])
    #> List of 1
    #>  $ a: int [1:3] 1 2 3
    
    str(l[4])
    #> List of 1
    #>  $ d:List of 2
    #>   ..$ : num -1
    #>   ..$ : num -5

    Like with vectors, you can subset with a logical, integer, or character vector.
    与向量一样,您可以使用逻辑向量、整数向量或字符向量进行子集化。

  • [[ and $ extract a single component from a list. They remove a level of hierarchy from the list.
    [[$从列表中提取单个组件。他们从列表中删除了一个层次结构级别。

    str(l[[1]])
    #>  int [1:3] 1 2 3
    
    str(l[[4]])
    #> List of 2
    #>  $ : num -1
    #>  $ : num -5
    
    str(l$a)
    #>  int [1:3] 1 2 3

The difference between [ and [[ is particularly important for lists because [[ drills down into the list while [ returns a new, smaller list. To help you remember the difference, take a look at the unusual pepper shaker shown in . If this pepper shaker is your list pepper, then, pepper[1] is a pepper shaker containing a single pepper packet. pepper[2] would look the same, but would contain the second packet. pepper[1:2] would be a pepper shaker containing two pepper packets. pepper[[1]] would extract the pepper packet itself.
[[[之间的区别对于列表尤其重要,因为[[深入到列表中,而[返回一个新的、更小的列表。为了帮助您记住差异,请看一下中所示的不寻常的胡椒瓶。如果这个胡椒瓶是您的pepper清单,那么, pepper[1]是一个包含单个胡椒包的胡椒瓶。 pepper[2]看起来相同,但包含第二个数据包。 pepper[1:2]是一个包含两个胡椒包的胡椒瓶。 pepper[[1]]会提取胡椒包本身。

Three photos. On the left is a photo of a glass pepper shaker. Instead of the pepper shaker containing pepper, it contains a single packet of pepper. In the middle is a photo of a single packet of pepper. On the right is a photo of the contents of a packet of pepper.
Figure 27.1: (Left) A pepper shaker that Hadley once found in his hotel room. (Middle) pepper[1]. (Right) pepper[[1]]
图 27.1:(左)哈德利曾经在他的酒店房间里发现的胡椒瓶。 (中) pepper[1] . (右) pepper[[1]]

This same principle applies when you use 1d [ with a data frame: df["x"] returns a one-column data frame and df[["x"]] returns a vector.
当您将 1d [与数据框一起使用时,同样的原则也适用: df["x"]返回一列数据框, df[["x"]]返回向量。

27.3.4 Exercises
27.3.4练习

  1. What happens when you use [[ with a positive integer that’s bigger than the length of the vector? What happens when you subset with a name that doesn’t exist?
    当您将[[与大于向量长度的正整数一起使用时会发生什么?当您使用不存在的名称进行子集化时会发生什么?

  2. What would pepper[[1]][1] be? What about pepper[[1]][[1]]?
    pepper[[1]][1]是什么? pepper[[1]][[1]]怎么样?

27.4 Apply family
27.4应用家庭

In , you learned tidyverse techniques for iteration like dplyr::across() and the map family of functions. In this section, you’ll learn about their base equivalents, the apply family. In this context apply and map are synonyms because another way of saying “map a function over each element of a vector” is “apply a function over each element of a vector”. Here we’ll give you a quick overview of this family so you can recognize them in the wild.
中,您学习了 tidyverse 迭代技术,例如dplyr::across()和 map 系列函数。在本节中,您将了解它们的基本等效项apply 系列。在这种情况下, apply 和 map 是同义词,因为“在向量的每个元素上映射函数”的另一种说法是“在向量的每个元素上应用函数”。在这里,我们将向您简要介绍这个家族,以便您可以在野外认出它们。

The most important member of this family is lapply(), which is very similar to purrr::map(). In fact, because we haven’t used any of map()’s more advanced features, you can replace every map() call in with lapply().
这个家族中最重要的成员是lapply() ,它与purrr::map() 非常相似。事实上,因为我们没有使用任何map()的更高级的功能,所以你可以用lapply()替换中的每个map()调用。

There’s no exact base R equivalent to across() but you can get close by using [ with lapply(). This works because under the hood, data frames are lists of columns, so calling lapply() on a data frame applies the function to each column.
没有与across()完全相同的基数 R,但您可以通过将[lapply()一起使用来接近。这是有效的,因为在底层,数据框是列的列表,因此在数据框上调用lapply()会将函数应用于每一列。

df <- tibble(a = 1, b = 2, c = "a", d = "b", e = 4)

# First find numeric columns
num_cols <- sapply(df, is.numeric)
num_cols
#>     a     b     c     d     e 
#>  TRUE  TRUE FALSE FALSE  TRUE

# Then transform each column with lapply() then replace the original values
df[, num_cols] <- lapply(df[, num_cols, drop = FALSE], \(x) x * 2)
df
#> # A tibble: 1 × 5
#>       a     b c     d         e
#>   <dbl> <dbl> <chr> <chr> <dbl>
#> 1     2     4 a     b         8

The code above uses a new function, sapply(). It’s similar to lapply() but it always tries to simplify the result, hence the s in its name, here producing a logical vector instead of a list. We don’t recommend using it for programming, because the simplification can fail and give you an unexpected type, but it’s usually fine for interactive use. purrr has a similar function called map_vec() that we didn’t mention in .
上面的代码使用了一个新函数sapply() 。它与lapply()类似,但它总是尝试简化结果,因此其名称中包含s ,此处生成逻辑向量而不是列表。我们不建议将它用于编程,因为简化可能会失败并给您带来意外的类型,但它通常适合交互式使用。 purrr 有一个类似的函数,称为map_vec() ,我们在中没有提到。

Base R provides a stricter version of sapply() called vapply(), short for vector apply. It takes an additional argument that specifies the expected type, ensuring that simplification occurs the same way regardless of the input. For example, we could replace the sapply() call above with this vapply() where we specify that we expect is.numeric() to return a logical vector of length 1:
Base R 提供了sapply()的更严格版本,称为vapply() ,是向量apply 的缩写。它需要一个额外的参数来指定预期的类型,确保无论输入如何,都会以相同的方式进行简化。例如,我们可以用vapply() sapply()调用,其中我们指定期望is.numeric()返回长度为 1 的逻辑向量:

vapply(df, is.numeric, logical(1))
#>     a     b     c     d     e 
#>  TRUE  TRUE FALSE FALSE  TRUE

The distinction between sapply() and vapply() is really important when they’re inside a function (because it makes a big difference to the function’s robustness to unusual inputs), but it doesn’t usually matter in data analysis.
sapply()vapply()位于函数内部时,它们之间的区别非常重要(因为它对函数对异常输入的鲁棒性有很大影响),但在数据分析中通常并不重要。

Another important member of the apply family is tapply() which computes a single grouped summary:
apply 系列的另一个重要成员是tapply() ,它计算单个分组摘要:

diamonds |> 
  group_by(cut) |> 
  summarize(price = mean(price))
#> # A tibble: 5 × 2
#>   cut       price
#>   <ord>     <dbl>
#> 1 Fair      4359.
#> 2 Good      3929.
#> 3 Very Good 3982.
#> 4 Premium   4584.
#> 5 Ideal     3458.

tapply(diamonds$price, diamonds$cut, mean)
#>      Fair      Good Very Good   Premium     Ideal 
#>  4358.758  3928.864  3981.760  4584.258  3457.542

Unfortunately tapply() returns its results in a named vector which requires some gymnastics if you want to collect multiple summaries and grouping variables into a data frame (it’s certainly possible to not do this and just work with free floating vectors, but in our experience that just delays the work). If you want to see how you might use tapply() or other base techniques to perform other grouped summaries, Hadley has collected a few techniques in a gist.
不幸的是, tapply()在一个命名向量中返回其结果,如果您想收集多个摘要并将变量分组到一个数据框中,则需要一些体操(当然可以不这样做,而只使用自由浮动向量,但根据我们的经验,只会耽误工作)。如果您想了解如何使用tapply()或其他基本技术来执行其他分组摘要,Hadley在要点中收集了一些技术。

The final member of the apply family is the titular apply(), which works with matrices and arrays. In particular, watch out for apply(df, 2, something), which is a slow and potentially dangerous way of doing lapply(df, something). This rarely comes up in data science because we usually work with data frames and not matrices.
apply 系列的最后一个成员是名义上的apply() ,它适用于矩阵和数组。特别要注意apply(df, 2, something) ,这是一种缓慢且潜在危险的lapply(df, something)方法。这在数据科学中很少出现,因为我们通常使用数据框而不是矩阵。

27.5 for loops
27.5 for循环

for loops are the fundamental building block of iteration that both the apply and map families use under the hood. for loops are powerful and general tools that are important to learn as you become a more experienced R programmer. The basic structure of a for loop looks like this:
for循环是 apply 和 map 系列在底层使用的迭代的基本构建块。 for循环是功能强大且通用的工具,当您成为更有经验的 R 程序员时,学习这些工具非常重要。 for循环的基本结构如下所示:

for (element in vector) {
  # do something with element
}

The most straightforward use of for loops is to achieve the same effect as walk(): call some function with a side-effect on each element of a list. For example, in instead of using walk():
for循环最直接的用法是实现与walk()相同的效果:调用某个对列表的每个元素产生副作用的函数。例如,在中,不要使用walk()

paths |> walk(append_file)

We could have used a for loop:
我们可以使用for循环:

for (path in paths) {
  append_file(path)
}

Things get a little trickier if you want to save the output of the for loop, for example reading all of the excel files in a directory like we did in :
如果你想保存for循环的输出,事情会变得有点棘手,例如像我们在中所做的那样读取目录中的所有 excel 文件:

paths <- dir("data/gapminder", pattern = "\\.xlsx$", full.names = TRUE)
files <- map(paths, readxl::read_excel)

There are a few different techniques that you can use, but we recommend being explicit about what the output is going to look like upfront. In this case, we’re going to want a list the same length as paths, which we can create with vector():
您可以使用几种不同的技术,但我们建议您预先明确输出的外观。在这种情况下,我们需要一个与paths长度相同的列表,我们可以使用vector()创建它:

files <- vector("list", length(paths))

Then instead of iterating over the elements of paths, we’ll iterate over their indices, using seq_along() to generate one index for each element of paths:
然后,我们不再迭代paths的元素,而是迭代它们的索引,使用seq_along()为 paths 的每个元素生成一个索引:

seq_along(paths)
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12

Using the indices is important because it allows us to link to each position in the input with the corresponding position in the output:
使用索引很重要,因为它允许我们将输入中的每个位置与输出中的相应位置链接起来:

for (i in seq_along(paths)) {
  files[[i]] <- readxl::read_excel(paths[[i]])
}

To combine the list of tibbles into a single tibble you can use do.call() + rbind():
要将 tibbles 列表合并为单个 tibble,您可以使用do.call() + rbind()

do.call(rbind, files)
#> # A tibble: 1,704 × 5
#>   country     continent lifeExp      pop gdpPercap
#>   <chr>       <chr>       <dbl>    <dbl>     <dbl>
#> 1 Afghanistan Asia         28.8  8425333      779.
#> 2 Albania     Europe       55.2  1282697     1601.
#> 3 Algeria     Africa       43.1  9279525     2449.
#> 4 Angola      Africa       30.0  4232095     3521.
#> 5 Argentina   Americas     62.5 17876956     5911.
#> 6 Australia   Oceania      69.1  8691212    10040.
#> # ℹ 1,698 more rows

Rather than making a list and saving the results as we go, a simpler approach is to build up the data frame piece-by-piece:
一种更简单的方法是逐个构建数据框,而不是制作列表并保存结果:

out <- NULL
for (path in paths) {
  out <- rbind(out, readxl::read_excel(path))
}

We recommend avoiding this pattern because it can become very slow when the vector is very long. This is the source of the persistent canard that for loops are slow: they’re not, but iteratively growing a vector is.
我们建议避免这种模式,因为当向量很长时它会变得非常慢。这就是for循环很慢的持久谣言的根源:它们不是,但迭代增长向量却很慢。

27.6 Plots
27.6绘图

Many R users who don’t otherwise use the tidyverse prefer ggplot2 for plotting due to helpful features like sensible defaults, automatic legends, and a modern look. However, base R plotting functions can still be useful because they’re so concise — it takes very little typing to do a basic exploratory plot.
许多不使用 tidyverse 的 R 用户更喜欢使用 ggplot2 进行绘图,因为它具有有用的功能,例如合理的默认值、自动图例和现代外观。然而,基本的 R 绘图函数仍然很有用,因为它们非常简洁——只需很少的输入即可完成基本的探索性绘图。

There are two main types of base plot you’ll see in the wild: scatterplots and histograms, produced with plot() and hist() respectively. Here’s a quick example from the diamonds dataset:
您在野外会看到两种主要类型的基本图:散点图和直方图,分别使用plot()hist()生成。以下是钻石数据集的一个简单示例:

# Left
hist(diamonds$carat)

# Right
plot(diamonds$carat, diamonds$price)

On the left, histogram of carats of diamonds, ranging from 0 to 5 carats. The distribution is unimodal and right-skewed. On the right, scatter plot of price vs. carat of diamonds, showing a positive relationship that fans out as both price and carat increases. The scatter plot shows very few diamonds bigger than 3 carats compared to diamonds between 0 to 3 carats.

On the left, histogram of carats of diamonds, ranging from 0 to 5 carats. The distribution is unimodal and right-skewed. On the right, scatter plot of price vs. carat of diamonds, showing a positive relationship that fans out as both price and carat increases. The scatter plot shows very few diamonds bigger than 3 carats compared to diamonds between 0 to 3 carats.

Note that base plotting functions work with vectors, so you need to pull columns out of the data frame using $ or some other technique.
请注意,基本绘图函数适用于向量,因此您需要使用$或其他技术将列从数据框中拉出。

27.7 Summary
27.7总结

In this chapter, we’ve shown you a selection of base R functions useful for subsetting and iteration. Compared to approaches discussed elsewhere in the book, these functions tend to have more of a “vector” flavor than a “data frame” flavor because base R functions tend to take individual vectors, rather than a data frame and some column specification. This often makes life easier for programming and so becomes more important as you write more functions and begin to write your own packages.
在本章中,我们向您展示了一些可用于子集化和迭代的基本 R 函数。与本书其他地方讨论的方法相比,这些函数往往更具“向量”风格,而不是“数据帧”风格,因为基本 R 函数倾向于采用单独的向量,而不是数据框和某些列规范。这通常会使编程变得更容易,因此当您编写更多函数并开始编写自己的包时,这变得更加重要。

This chapter concludes the programming section of the book. You’ve made a solid start on your journey to becoming not just a data scientist who uses R, but a data scientist who can program in R. We hope these chapters have sparked your interest in programming and that you’re looking forward to learning more outside of this book.
本章总结了本书的编程部分。您已经在自己的旅程中迈出了坚实的一步,不仅成为一名使用 R 的数据科学家,而且成为一名可以使用 R 进行编程的数据科学家。我们希望这些章节能够激发您对编程的兴趣,并期待您的学习本书之外的更多内容。


  1. Read https://adv-r.hadley.nz/subsetting.html#subset-multiple to see how you can also subset a data frame like it is a 1d object and how you can subset it with a matrix.↩︎
    阅读https://adv-r.hadley.nz/subsetting.html#subset-multiple了解如何将数据框子集化,就像它是一维对象一样,以及如何使用矩阵将其子集化。 ↩︎

  2. But it doesn’t handle grouped data frames differently and it doesn’t support selection helper functions like starts_with().↩︎
    但它不会以不同的方式处理分组数据帧,并且不支持选择辅助函数,例如starts_with()↩︎

  3. It just lacks convenient features like progress bars and reporting which element caused the problem if there’s an error.↩︎
    它只是缺乏方便的功能,例如进度条以及在出现错误时报告哪个元素导致问题。 ↩︎