There are four important verbs that affect the columns without changing the rows: mutate(), select(), rename(), and relocate(). mutate() creates new columns that are functions of the existing columns; select(), rename(), and relocate() change which columns are present, their names, or their positions.
mutate()
The job of mutate() is to add new columns that are calculated from the existing columns. In the transform chapters, you’ll learn a large set of functions that you can use to manipulate different types of variables. For now, we’ll stick with basic algebra, which allows us to compute the gain, how much time a delayed flight made up in the air, and the speed in miles per hour:
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60
  )
#> # A tibble: 336,776 × 21
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 336,770 more rows, 11 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>, and abbreviated
#> #   variable names ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time,
#> #   ⁵arr_delay
 
By default, mutate() adds new columns on the right hand side of your dataset, which makes it difficult to see what’s happening here. We can use the .before argument to instead add the variables to the left hand sideRemember that in RStudio, the easiest way to see a dataset with many columns is View().:
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .before = 1
  )
#> # A tibble: 336,776 × 21
#>    gain speed  year month   day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
#>   <dbl> <dbl> <int> <int> <int>    <int>        <int>   <dbl>   <int>   <int>
#> 1    -9  370.  2013     1     1      517          515       2     830     819
#> 2   -16  374.  2013     1     1      533          529       4     850     830
#> 3   -31  408.  2013     1     1      542          540       2     923     850
#> 4    17  517.  2013     1     1      544          545      -1    1004    1022
#> 5    19  394.  2013     1     1      554          600      -6     812     837
#> 6   -16  288.  2013     1     1      554          558      -4     740     728
#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#> #   ²dep_delay, ³arr_time, ⁴sched_arr_time
 
The . is a sign that .before is an argument to the function, not the name of a new variable. You can also use .after to add after a variable, and in both .before and .after you can the name of a variable name instead of a position. For example, we could add the new variables after day:
flights |> 
  mutate(
    gain = dep_delay - arr_delay,
    speed = distance / air_time * 60,
    .after = day
  )
#> # A tibble: 336,776 × 21
#>    year month   day  gain speed dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
#>   <int> <int> <int> <dbl> <dbl>    <int>        <int>   <dbl>   <int>   <int>
#> 1  2013     1     1    -9  370.      517          515       2     830     819
#> 2  2013     1     1   -16  374.      533          529       4     850     830
#> 3  2013     1     1   -31  408.      542          540       2     923     850
#> 4  2013     1     1    17  517.      544          545      -1    1004    1022
#> 5  2013     1     1    19  394.      554          600      -6     812     837
#> 6  2013     1     1   -16  288.      554          558      -4     740     728
#> # … with 336,770 more rows, 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#> #   ²dep_delay, ³arr_time, ⁴sched_arr_time
 
Alternatively, you can control which variables are kept with the .keep argument. A particularly useful argument is "used" which allows you to see the inputs and outputs from your calculations:
flights |> 
  mutate(,
    gain = dep_delay - arr_delay,
    hours = air_time / 60,
    gain_per_hour = gain / hours,
    .keep = "used"
  )
#> # A tibble: 336,776 × 6
#>   dep_delay arr_delay air_time  gain hours gain_per_hour
#>       <dbl>     <dbl>    <dbl> <dbl> <dbl>         <dbl>
#> 1         2        11      227    -9  3.78         -2.38
#> 2         4        20      227   -16  3.78         -4.23
#> 3         2        33      160   -31  2.67        -11.6 
#> 4        -1       -18      183    17  3.05          5.57
#> 5        -6       -25      116    19  1.93          9.83
#> 6        -4        12      150   -16  2.5          -6.4 
#> # … with 336,770 more rows
 
select()
It’s not uncommon to get datasets with hundreds or even thousands of variables. In this situation, the first challenge is often just focusing on the variables you’re interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. select() is not terribly useful with the flights data because we only have 19 variables, but you can still get the general idea of how it works:
# Select columns by name
flights |> 
  select(year, month, day)
#> # A tibble: 336,776 × 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
flights |> 
  select(year:day)
#> # A tibble: 336,776 × 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
flights |> 
  select(!year:day)
#> # A tibble: 336,776 × 16
#>   dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum
#>      <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>  
#> 1      517         515       2     830     819      11 UA        1545 N14228 
#> 2      533         529       4     850     830      20 UA        1714 N24211 
#> 3      542         540       2     923     850      33 AA        1141 N619AA 
#> 4      544         545      -1    1004    1022     -18 B6         725 N804JB 
#> 5      554         600      -6     812     837     -25 DL         461 N668DN 
#> 6      554         558      -4     740     728      12 UA        1696 N39463 
#> # … with 336,770 more rows, 7 more variables: origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
#> #   ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
# Select all columns that are characters
flights |> 
  select(where(is.character))
#> # A tibble: 336,776 × 4
#>   carrier tailnum origin dest 
#>   <chr>   <chr>   <chr>  <chr>
#> 1 UA      N14228  EWR    IAH  
#> 2 UA      N24211  LGA    IAH  
#> 3 AA      N619AA  JFK    MIA  
#> 4 B6      N804JB  JFK    BQN  
#> 5 DL      N668DN  LGA    ATL  
#> 6 UA      N39463  EWR    ORD  
#> # … with 336,770 more rows
 
There are a number of helper functions you can use within select():
- 
starts_with("abc"): matches names that begin with “abc”. 
- 
ends_with("xyz"): matches names that end with “xyz”. 
- 
contains("ijk"): matches names that contain “ijk”. 
- 
num_range("x", 1:3): matches x1, x2 and x3. 
See ?select for more details. Once you know regular expressions (the topic of #chp-regexps) you’ll also be use matches() to select variables that match a pattern.
You can rename variables as you select() them by using =. The new name appears on the left hand side of the =, and the old variable appears on the right hand side:
flights |> 
  select(tail_num = tailnum)
#> # A tibble: 336,776 × 1
#>   tail_num
#>   <chr>   
#> 1 N14228  
#> 2 N24211  
#> 3 N619AA  
#> 4 N804JB  
#> 5 N668DN  
#> 6 N39463  
#> # … with 336,770 more rows
 
rename()
If you just want to keep all the existing variables and just want to rename a few, you can use rename() instead of select():
flights |> 
  rename(tail_num = tailnum)
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
#>   <int> <int> <int>    <int>    <int>   <dbl>   <int>   <int>   <dbl> <chr>  
#> 1  2013     1     1      517      515       2     830     819      11 UA     
#> 2  2013     1     1      533      529       4     850     830      20 UA     
#> 3  2013     1     1      542      540       2     923     850      33 AA     
#> 4  2013     1     1      544      545      -1    1004    1022     -18 B6     
#> 5  2013     1     1      554      600      -6     812     837     -25 DL     
#> 6  2013     1     1      554      558      -4     740     728      12 UA     
#> # … with 336,770 more rows, 9 more variables: flight <int>, tail_num <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
 
It works exactly the same way as select(), but keeps all the variables that aren’t explicitly selected.
If you have a bunch of inconsistently named columns and it would be painful to fix them all by hand, check out janitor::clean_names() which provides some useful automated cleaning.
relocate()
Use relocate() to move variables around. You might want to collect related variables together or move important variables to the front. By default relocate() moves variables to the front:
flights |> 
  relocate(time_hour, air_time)
#> # A tibble: 336,776 × 19
#>   time_hour           air_time  year month   day dep_time sched_dep…¹ dep_d…²
#>   <dttm>                 <dbl> <int> <int> <int>    <int>       <int>   <dbl>
#> 1 2013-01-01 05:00:00      227  2013     1     1      517         515       2
#> 2 2013-01-01 05:00:00      227  2013     1     1      533         529       4
#> 3 2013-01-01 05:00:00      160  2013     1     1      542         540       2
#> 4 2013-01-01 05:00:00      183  2013     1     1      544         545      -1
#> 5 2013-01-01 06:00:00      116  2013     1     1      554         600      -6
#> 6 2013-01-01 05:00:00      150  2013     1     1      554         558      -4
#> # … with 336,770 more rows, 11 more variables: arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, and abbreviated variable names ¹sched_dep_time, ²dep_delay
 
But you can use the same .before and .after arguments as mutate() to choose where to put them:
flights |> 
  relocate(year:dep_time, .after = time_hour)
#> # A tibble: 336,776 × 19
#>   sched…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier flight tailnum origin dest 
#>     <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr>  <chr>
#> 1     515       2     830     819      11 UA        1545 N14228  EWR    IAH  
#> 2     529       4     850     830      20 UA        1714 N24211  LGA    IAH  
#> 3     540       2     923     850      33 AA        1141 N619AA  JFK    MIA  
#> 4     545      -1    1004    1022     -18 B6         725 N804JB  JFK    BQN  
#> 5     600      -6     812     837     -25 DL         461 N668DN  LGA    ATL  
#> 6     558      -4     740     728      12 UA        1696 N39463  EWR    ORD  
#> # … with 336,770 more rows, 9 more variables: air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, year <int>,
#> #   month <int>, day <int>, dep_time <int>, and abbreviated variable names
#> #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
flights |> 
  relocate(starts_with("arr"), .before = dep_time)
#> # A tibble: 336,776 × 19
#>    year month   day arr_time arr_de…¹ dep_t…² sched…³ dep_d…⁴ sched…⁵ carrier
#>   <int> <int> <int>    <int>    <dbl>   <int>   <int>   <dbl>   <int> <chr>  
#> 1  2013     1     1      830       11     517     515       2     819 UA     
#> 2  2013     1     1      850       20     533     529       4     830 UA     
#> 3  2013     1     1      923       33     542     540       2     850 AA     
#> 4  2013     1     1     1004      -18     544     545      -1    1022 B6     
#> 5  2013     1     1      812      -25     554     600      -6     837 DL     
#> 6  2013     1     1      740       12     554     558      -4     728 UA     
#> # … with 336,770 more rows, 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹arr_delay, ²dep_time, ³sched_dep_time, ⁴dep_delay, ⁵sched_arr_time